Correlation and Regression#
Spatial Correlation with Lead-Lag Time Series#
This section explains how to compute grid-wise Pearson correlation between a regional time series (e.g., Niño 3.4 index) and a gridded field (e.g., seasonal anomalies). These functions can be used for lead-lag analysis by adjusting the year slices appropriately.
📌 Method 1: Using GeoCAT (Faster, but May Show Warnings)#
This approach uses geocat.comp.stats.pearson_r
. It is fast and robust but may produce harmless warnings when missing values exist in some grid cells. This method replaces the year
coordinate with a dummy index to avoid alignment issues during correlation.
def lead_lag_correlation_geocat(ts, ystr1, yend1, dat, ystr2, yend2, min_coverage=0.9):
"""
Compute Pearson correlation using GeoCAT with a 1D time series and a 3D gridded field.
"""
# Subset and align years
ts_sel = ts.sel(year=slice(ystr1, yend1))
dat_sel = dat.sel(year=slice(ystr2, yend2))
if ts_sel['year'].size != dat_sel['year'].size:
raise ValueError(f"Year mismatch: ts has {ts_sel['year'].size}, dat has {dat_sel['year'].size}.")
# Replace year with dummy index
dummy_year = np.arange(ts_sel['year'].size)
ts_sel = ts_sel.assign_coords(year=dummy_year)
dat_sel = dat_sel.assign_coords(year=dummy_year)
# Mask low-variance or missing regions
dat_std = dat_sel.std(dim="year", skipna=True)
valid_counts = dat_sel.count(dim="year")
min_required_years = int(dat_sel['year'].size * min_coverage)
invalid_mask = (dat_std < 1e-10) | (valid_counts < min_required_years)
dat_masked = dat_sel.where(~invalid_mask)
# Compute correlation
corr = gccomp.stats.pearson_r(ts_sel, dat_masked, dim="year", skipna=True, keep_attrs=True)
return corr
📌 Method 2: Using NumPy (Warning-Free but Slower)#
This version uses numpy.corrcoef
explicitly and avoids alignment issues and warnings. It is slightly slower for large datasets but more transparent.
def lead_lag_correlation(ts, ystr1, yend1, dat, ystr2, yend2, min_coverage=0.9):
"""
Compute Pearson correlation using NumPy with a 1D time series and a 3D gridded field.
"""
ts_sel = ts.sel(year=slice(ystr1, yend1)).values
dat_sel = dat.sel(year=slice(ystr2, yend2)).values
if ts_sel.shape[0] != dat_sel.shape[0]:
raise ValueError(f"Mismatch in year length: ts={ts_sel.shape[0]}, dat={dat_sel.shape[0]}.")
ntime, nlat, nlon = dat_sel.shape
valid_counts = np.isfinite(dat_sel).sum(axis=0)
min_required_years = int(ntime * min_coverage)
sufficient_mask = valid_counts >= min_required_years
corr_map = np.full((nlat, nlon), np.nan)
for i in range(nlat):
for j in range(nlon):
if sufficient_mask[i, j]:
x = dat_sel[:, i, j]
mask = np.isfinite(x) & np.isfinite(ts_sel)
if mask.sum() > 2:
r = np.corrcoef(ts_sel[mask], x[mask])[0, 1]
corr_map[i, j] = r
corr_da = xr.DataArray(
corr_map,
dims=['lat', 'lon'],
coords={'lat': dat['lat'], 'lon': dat['lon']},
name='lead_lag_correlation'
)
corr_da = corr_da.where(sufficient_mask)
return corr_da
⚠️ Note on Time Coordinates#
For both methods, lag correlation requires dropping the original time coordinate. This is why a dummy year
index is used, ensuring correct alignment of shifted time periods between the predictor and the response field.
Regression Map (Optional Addition)#
Once you calculate a correlation map, you may also want to derive a regression map:
# Regress field anomalies onto standardized index (ts_std)
reg_map = dat.sel(year=slice(ystr2, yend2)).mean(dim='year') * corr / ts.std()
Here, the regression coefficient reflects the amplitude response per unit change in the index (e.g., Niño 3.4), assuming anomalies have been used and the index is standardized.
This module equips you to perform both correlation and regression analysis for diagnosing climate teleconnections.
📌 Example Usage#
Assume you have a standardized Niño 3.4 index (nino34_ts_std
) and a seasonal anomaly field (dat_NDJFM
), both on the year
dimension:
# Set years for predictor (Niño 3.4) and response (e.g., NDJFM SST anomalies)
ystr_index, yend_index = 1991, 2020
ystr_field, yend_field = 1991, 2020
# Correlation using GeoCAT (fast, may warn on missing values)
corr_map1 = lead_lag_correlation_geocat(nino34_ts_std, ystr_index, yend_index,
dat_NDJFM, ystr_field, yend_field)
# Correlation using NumPy (no warnings, slightly slower)
corr_map2 = lead_lag_correlation(nino34_ts_std, ystr_index, yend_index,
dat_NDJFM, ystr_field, yend_field)
# Optional regression map using standard deviation
nino_std = nino34_ts_std.sel(year=slice(ystr_index, yend_index)).std(dim="year")
std_Y = dat_NDJFM.sel(year=slice(ystr_field, yend_field)).std(dim="year")
reg_map = std_Y * corr_map2 / nino_std
📝 Make sure your index and anomaly field are both aligned on the
year
coordinate before computing.
▶️ Example: One-Year Lag Correlation#
To perform a one-year lagged correlation (e.g., Niño 3.4 from 1952–2020 against NDJFM anomalies from 1951–2019):
corr_m12 = lead_lag_correlation_geocat(
ts=nino34_ts_std,
ystr1=1952, yend1=2020,
dat=dat_NDJFM,
ystr2=1951, yend2=2019,
min_coverage=0.9
)
In this case, the correlation tests whether the NDJFM response lags one year behind the Niño 3.4 index. Ensure that both ts
and dat
are aligned on the 'year'
dimension, and that the dummy index logic inside the function handles this offset correctly.
This example illustrates how to explore climate teleconnections using spatial correlation and regression diagnostics.