Correlation and Regression

Correlation and Regression#

Spatial Correlation with Lead-Lag Time Series#

This section explains how to compute grid-wise Pearson correlation between a regional time series (e.g., Niño 3.4 index) and a gridded field (e.g., seasonal anomalies). These functions can be used for lead-lag analysis by adjusting the year slices appropriately.

📌 Method 1: Using GeoCAT (Faster, but May Show Warnings)#

This approach uses geocat.comp.stats.pearson_r. It is fast and robust but may produce harmless warnings when missing values exist in some grid cells. This method replaces the year coordinate with a dummy index to avoid alignment issues during correlation.

def lead_lag_correlation_geocat(ts, ystr1, yend1, dat, ystr2, yend2, min_coverage=0.9):
    """
    Compute Pearson correlation using GeoCAT with a 1D time series and a 3D gridded field.
    """
    # Subset and align years
    ts_sel = ts.sel(year=slice(ystr1, yend1))
    dat_sel = dat.sel(year=slice(ystr2, yend2))

    if ts_sel['year'].size != dat_sel['year'].size:
        raise ValueError(f"Year mismatch: ts has {ts_sel['year'].size}, dat has {dat_sel['year'].size}.")

    # Replace year with dummy index
    dummy_year = np.arange(ts_sel['year'].size)
    ts_sel = ts_sel.assign_coords(year=dummy_year)
    dat_sel = dat_sel.assign_coords(year=dummy_year)

    # Mask low-variance or missing regions
    dat_std = dat_sel.std(dim="year", skipna=True)
    valid_counts = dat_sel.count(dim="year")
    min_required_years = int(dat_sel['year'].size * min_coverage)
    invalid_mask = (dat_std < 1e-10) | (valid_counts < min_required_years)
    dat_masked = dat_sel.where(~invalid_mask)

    # Compute correlation
    corr = gccomp.stats.pearson_r(ts_sel, dat_masked, dim="year", skipna=True, keep_attrs=True)
    return corr

📌 Method 2: Using NumPy (Warning-Free but Slower)#

This version uses numpy.corrcoef explicitly and avoids alignment issues and warnings. It is slightly slower for large datasets but more transparent.

def lead_lag_correlation(ts, ystr1, yend1, dat, ystr2, yend2, min_coverage=0.9):
    """
    Compute Pearson correlation using NumPy with a 1D time series and a 3D gridded field.
    """
    ts_sel = ts.sel(year=slice(ystr1, yend1)).values
    dat_sel = dat.sel(year=slice(ystr2, yend2)).values

    if ts_sel.shape[0] != dat_sel.shape[0]:
        raise ValueError(f"Mismatch in year length: ts={ts_sel.shape[0]}, dat={dat_sel.shape[0]}.")

    ntime, nlat, nlon = dat_sel.shape
    valid_counts = np.isfinite(dat_sel).sum(axis=0)
    min_required_years = int(ntime * min_coverage)
    sufficient_mask = valid_counts >= min_required_years

    corr_map = np.full((nlat, nlon), np.nan)

    for i in range(nlat):
        for j in range(nlon):
            if sufficient_mask[i, j]:
                x = dat_sel[:, i, j]
                mask = np.isfinite(x) & np.isfinite(ts_sel)
                if mask.sum() > 2:
                    r = np.corrcoef(ts_sel[mask], x[mask])[0, 1]
                    corr_map[i, j] = r

    corr_da = xr.DataArray(
        corr_map,
        dims=['lat', 'lon'],
        coords={'lat': dat['lat'], 'lon': dat['lon']},
        name='lead_lag_correlation'
    )

    corr_da = corr_da.where(sufficient_mask)
    return corr_da

⚠️ Note on Time Coordinates#

For both methods, lag correlation requires dropping the original time coordinate. This is why a dummy year index is used, ensuring correct alignment of shifted time periods between the predictor and the response field.

Regression Map (Optional Addition)#

Once you calculate a correlation map, you may also want to derive a regression map:

# Regress field anomalies onto standardized index (ts_std)
reg_map = dat.sel(year=slice(ystr2, yend2)).mean(dim='year') * corr / ts.std()

Here, the regression coefficient reflects the amplitude response per unit change in the index (e.g., Niño 3.4), assuming anomalies have been used and the index is standardized.

This module equips you to perform both correlation and regression analysis for diagnosing climate teleconnections.

📌 Example Usage#

Assume you have a standardized Niño 3.4 index (nino34_ts_std) and a seasonal anomaly field (dat_NDJFM), both on the year dimension:

# Set years for predictor (Niño 3.4) and response (e.g., NDJFM SST anomalies)
ystr_index, yend_index = 1991, 2020
ystr_field, yend_field = 1991, 2020

# Correlation using GeoCAT (fast, may warn on missing values)
corr_map1 = lead_lag_correlation_geocat(nino34_ts_std, ystr_index, yend_index,
                                        dat_NDJFM, ystr_field, yend_field)

# Correlation using NumPy (no warnings, slightly slower)
corr_map2 = lead_lag_correlation(nino34_ts_std, ystr_index, yend_index,
                                 dat_NDJFM, ystr_field, yend_field)

# Optional regression map using standard deviation
nino_std = nino34_ts_std.sel(year=slice(ystr_index, yend_index)).std(dim="year")
std_Y = dat_NDJFM.sel(year=slice(ystr_field, yend_field)).std(dim="year")
reg_map = std_Y * corr_map2 / nino_std

📝 Make sure your index and anomaly field are both aligned on the year coordinate before computing.

▶️ Example: One-Year Lag Correlation#

To perform a one-year lagged correlation (e.g., Niño 3.4 from 1952–2020 against NDJFM anomalies from 1951–2019):

corr_m12 = lead_lag_correlation_geocat(
    ts=nino34_ts_std,
    ystr1=1952, yend1=2020,
    dat=dat_NDJFM,
    ystr2=1951, yend2=2019,
    min_coverage=0.9
)

In this case, the correlation tests whether the NDJFM response lags one year behind the Niño 3.4 index. Ensure that both ts and dat are aligned on the 'year' dimension, and that the dummy index logic inside the function handles this offset correctly.

This example illustrates how to explore climate teleconnections using spatial correlation and regression diagnostics.