Basics

Climate datasets consist of multi-dimensions, typically for time, height (or depth), latitude, and longitude. You may want to get the data value at a specific location and time, such as monthly sea surface temperature anomalies at the equator during January 1998. Or you may just want to know the size of dimensions for the dataset. There are several ways to manipulate the datasets. However, it makes us confused about how to do that. The main reason for this confusion arises from the fact that we are using several python packages. Even though there are many similar features in NumPy, Pandas, and Xarray, those packages have different purposes and objectives.

A dataset in NumPy is simply a multi-dimensional array. It doesn’t contain any coordinates. Since the Numpy doesn’t connect the data array with the coordinate, you must define them in your way. It’s more like mathematics.

Pandas can convert a NumPy array into a time series dataset. It basically contains a time coordinate for multiple variables. In other words, a data array in Pandas is a 1-dimension dataset (i.e., time dimension). So we require some different ways to manipulate the Pandas array compared with the NumPy array. It is noteworthy that the Pandas array only has the time coordinate but NOT for the spatiotemporal multi-coordinates (i.e., time, height, latitude, longitude coordinates). Therefore, the Pandas array is insufficient for manipulating climate datasets.

Xarray is the best way to organize climate datasets. Once we define the Xarray dataset appropriately, there are many flexible approaches to manipulate the Xarray dataset. Ultimately, the Xarray dataset includes many methods used in Numpy and Pandas and its own methods. Therefore, you need to ensure what package describes the data array: NumPy, Pandas, or Xarray.

Data Properties

This section is a summary of common operators. Please see the details of data structure in the next section.

NumPy

Let’s assume that dat is the numpy array

  • dat.ndim : Dimension of numpy array

  • dat.shape : Shape of numpy array

  • dat.size : Total number of numpy elements

Pandas

Let’s assume that sd is the Series in Pandas

  • sd.ndim : Dimension of Series array

  • sd.shape : Shape of Series array

  • sd.size : Total number of Series elements

  • sd.values : Values of Series array

  • sd.index : Index of Series array

  • sd.axes : Axes of Series array

  • sd.name : Name of Series array

Let’s assume that df is the DataFrame in Pandas

  • df.ndim : Dimension of DataFrame array

  • df.shape : Shape of DataFrame array

  • df.size : Total number of DataFrame elements

  • df.values : Values of DataFrame array

  • df.index : Index of DataFrame array

  • df.axes : Axis of DataFrame array

  • df.attrs : Attributes of DataFrame array

  • df.columns : Columns of DataFrame array

Xarray

Let’s assume that da is the DataArray.

  • da.ndim : Dimension of DataArray array

  • da.shape : Shape of DataArray array

  • da.size : Total number of DataArray elements

  • da.values : Values of DataArray

  • da.coords : Coordinates of DataArray

  • da.attrs : Attributions of DataArray

  • ds.data : Similar to the data.values

Let’s assume that ds is the DataSet.

  • ds.ndim : Dimension of DataSet array

  • ds.shape : Shape of DataSet array

  • ds.size : Total number of DataSet elements

  • ds.values : Values of DataSet

  • ds.coords : Coordinates of DataSet

  • ds.attrs : Attributions of DataSet

  • ds.data_vars : Dictionary of data variables

Data Structures

References:

Pandas

Pandas has two types of data structure: Series and DataFrame. The data array Series is the one-dimensional array referred to by the index. One example is the time series of maximum daily temperature labeled by date. So, it is a 1-D array with the length of the date. DataFrame is a group of Series. Each Series in DataFrame should have the same index (i.e., the same date coordinate). For example, DataFrame contains a time series of daily maximum temperature and daily precipitation labeled by date. So, it is an N-D array with the length of the date where N is the number of variables.

Warning

Index is a confusing concept in Pandas and Xarray. It includes several meatnings. Sometimes, index sounds like a label, a coordinate, or a reference. In Pandas, rows and columns both have indexes. Let’s assume that you have a table like CSV file. Generally speaking, the most left column would be index and the top row would be header.

Sample program: pandas_str-v1.py.

Series

Let’s make the data array pandas.Series. Here, we assume daily maximum temperature (tmax) and precipitation (prcp) from January 1, 2018, to January 5, 2018. The index of Series data array corresponds to the date for that period. A sample program looks like below.

1import numpy as np
2import pandas as pd
3
4day = pd.date_range("2018-01-01", periods=5, freq="D")
5
6tmax = pd.Series(np.random.randn(day.shape[0]), index=day)
7prcp = pd.Series(np.random.randn(day.shape[0]), index=day)
8

This sample script made the index day using the panda.date_range function. It covers from January 1, 2018 to January 5, 2018 with 1 day interval. The output looks like below.:

print(day)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05'],
              dtype='datetime64[ns]', freq='D')

The sample script provided random values of tmax and prcp based on numpy arraies using the numpy.random.randn function. We can make the Serires data array using the pandas.Series function. The output of tmax looks like:

print(tmax)

2018-01-01    1.183303
2018-01-02   -1.972609
2018-01-03    0.707680
2018-01-04    0.427772
2018-01-05   -0.903869
Freq: D, dtype: float64

The prcp variable also has the same dimension size with tmax. The output of prcp looks like:

print(prcp)

2018-01-01   -0.016126
2018-01-02    2.151788
2018-01-03    1.134309
2018-01-04    0.015779
2018-01-05   -0.127509
Freq: D, dtype: float64

Note

We generated random values using the NumPy function. So, your output of tmax and prcp would differ from here.

DataFrame

Next, let’s create the DataFrame. The DataFrame contains several Series arrays. We can create a new DataFrame array (df) using the pandas.DataFrame function. Here, we show the example of two variables (i.e., “tmax” and “prcp”), but you can add more variables as you like.

9df = pd.DataFrame (data={ 'tmax': tmax, 'prcp': prcp},index=day)

Here is the output of df:

print(df)
                tmax      prcp
2018-01-01 -0.674303 -0.558260
2018-01-02 -0.157591 -0.511053
2018-01-03  1.403665  0.714594
2018-01-04 -0.405556  1.091943
2018-01-05  0.279516 -0.269472

Xarray

As we have seen above, Pandas data arrays have two types. The first type is a single variable of data array with a label, called Series. The second type is the multiple variables of data arrays (a suit of Series) with the same label, called DataFrame. Similarly, Xarray data arrays have two types: DataArray and DataSet.

DataArray consists of data value and coordinate. We can still use index like Pandas.Series. However, DataArray contains many coordinates, and it’s easy to organize the coordinates instead of the index. Since the climate data is a multi-dimensional array, we need to be familiar with DataArray. To organize multiple DataArray, we can apply the DataSet, mainly for reading and writing DataArrays.

Sample program: xarray_str-v1.py.

DataArray

Let’s check the script below.

 1import numpy as np
 2import pandas as pd
 3import xarray as xr
 4import os
 5
 6
 7# --- Coordinate ---
 8date = pd.period_range("2000-01-01", "2005-12-31", freq="M")
 9lat = np.arange(-89, 90, 1, dtype=float)
10lon = np.arange(0,  360, 1, dtype=float)
11
12# --- zero and random values --
13dzero = np.zeros((date.size,lat.size,lon.size), dtype=float)
14drand = np.random.rand(date.size,lat.size,lon.size)
15
16tmax = xr.DataArray(20.+drand, coords=[date,lat,lon], dims=["time","lat","lon"])
17tmax.name = "tmax"
18tmax.attrs["units"] = "degC"
19
20prcp = xr.DataArray(dzero, coords=[date,lat,lon], dims=["time","lat","lon"])
21prcp.name = "prcp"
22prcp.attrs["units"] = "mm/day"
23

First, we define the values for coordinates. Here, let’s consider a 3-dimensional variable consisting of time, latitude, and longitude coordinates. The time coordinate comes from pandas.period_range function, labeling YYYY-MM format (i.e., 2000-01). We named it “date” (Line 8). We also define latitude and longitude as “lat” and “lon” using a numpy.arange function (Lines 9-10). So, the lat-lon grid here corresponds to the 1x1 resolution covering from -89 to 89 in latitude and from 0 to 359 in longitude.

Next, we make fake values using NumPy functions (Lines 13-14). This part is flexible, and we can skip it if we get the actual values from a data file.

Finally, we define the DataArray using xarray.DataArray function as described on Lines 16 and 19. The function defines coordinates and dimensions in “coords” and “dims” options. We can also add “name” and “attrs” to describe more information on Lines 17-18 and 21-22. Here is the output of DataArray “tmax”:

print(tmax)
<xarray.DataArray 'tmax' (time: 72, lat: 179, lon: 360)>
array([[[20.17526909, 20.56449617, 20.39136298, ..., 20.44666016,
       20.63424445, 20.85352716],
      ...,
      [20.24997025, 20.60670297, 20.13117975, ..., 20.14150434,
       20.84208498, 20.98607752]]])
Coordinates:
  * time     (time) object 2000-01 2000-02 2000-03 ... 2005-10 2005-11 2005-12
  * lat      (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0
  * lon      (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
Attributes:
    units:    degC

DataSet

When DataArrays consist of the same coordinates, we can marge them using DataSet. This is an example of how to define DataSet.

25ds = xr.Dataset({
26     'temp': (['time','lat','lon'],tmax.data),
27     'prcp': (['time','lat','lon'],prcp.data)
28      },
29      coords={ 'time': date, 'lat': lat, 'lon': lon })

Here is the output of DataSet “df”:

print(df)
<xarray.Dataset>
Dimensions:  (time: 72, lat: 179, lon: 360)
Coordinates:
  * time     (time) object 2000-01 2000-02 2000-03 ... 2005-10 2005-11 2005-12
  * lat      (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0
  * lon      (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
Data variables:
    temp     (time, lat, lon) float64 20.88 20.94 20.05 ... 20.33 20.86 20.8
    prcp     (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0

Data Indexing

References:

Numpy

Sample script numpy_index-v1.py

Numpy array use the integer-based position. Let’s define two numpy arrays x1 and x2 at first, and see how the integer-based position works.:

--- x1 = np.arange(0, 20, 1) ---
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
--- x2 = np.arange(0, 40, 2) ---
[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38]

The integer-based position starts from 0 to N-length. You cna see the example below how to get the value at the specific position. Negative value indicates the postion from the last. You can seletect the last position using -1.

--- x1[0] ---
0

--- x1[-1] ---
19

--- x1[3] ---
3

--- x1[-3] ---
17

You can also select a range like this [start:end:interval].

--- x1[5:] ---
[ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

--- x1[:5] ---
[0 1 2 3 4]

--- x1[::2] ---
[ 0  2  4  6  8 10 12 14 16 18]

--- x1[::-2] ---
[19 17 15 13 11  9  7  5  3  1]

--- x2[2:8:2] ---
[ 4  8 12]

Or you can replace the value by selecting the interger-based poistion. Be caution for the end of range:

--- x1[3] = 0 ---
before
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
after
[ 0  1  2  0  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]

--- x2[3:7] = 0 ---
before
[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38]
after
[ 0  2  4  0  0  0  0 14 16 18 20 22 24 26 28 30 32 34 36 38]

Pandas

Pandas can use the integer-based position like Numpy array. Also, Pandas has a label-based position. As you can see below, the label-based positioning is useful but confusing because of indexing operators. The Python and NumPy have indexing operators [] and attribute operator . , which provides quick and easy access to data structures. These components make several methods for selecting elements as described below.

Sample script pandas_index-v1.py

Series

Pandas use the integer-based position iloc and the label-based position loc. There are comparable operators: iat and at. Let’s define Pandas.Series “tmax” and see examples using these operators below.:

--- tmax: Series ---
2018-01-01    20.0
2018-01-02    20.1
2018-01-03    20.2
2018-01-04    20.3
2018-01-05    20.4
Freq: D, dtype: float64

Integer-based positioning:

tmax[2] = 20.2
tmax.iloc[2] = 20.2
tmax.iat[2] = 20.2

Label-based positioning:

tmax['2018-01-03'] = 20.2
tmax.loc['2018-01-03'] = 20.2
tmax.at['2018-01-03'] = 20.2

DataFrame

Let’s defined Pandas.DataFrame array “df” with index=day like this:

--- df: DataFrame ---
            tmax  prcp
2018-01-01  20.0     0
2018-01-02  20.1     1
2018-01-03  20.2     2
2018-01-04  20.3     3
2018-01-05  20.4     4

We can use [] or . operators to derive the Series “tmax” from DataFrame “df”. Refering df['tmax'] and df.tmax provide the same result.

-- df['tmax'] or df.tmax --
2018-01-01    20.0
2018-01-02    20.1
2018-01-03    20.2
2018-01-04    20.3
2018-01-05    20.4
Freq: D, Name: tmax, dtype: float64

When we select the Series “tmax” from the DataFrame “df”, we can use both index-based and label-based positions, as we have seen in the Series above.

df.tmax[2] = 20.2
df.tmax.iloc[2] = 20.2
df.tmax['2018-01-03'] = 20.2
df.tmax.loc['2018-01-03'] = 20.2

In contrast to the Series, the DataFrame have to use iloc or loc pointing methods. Without pointing mehtod, your script would provide the error messages. Here, the DataFrame df contains index, tmax, and prcp in each column like below.

02018-01-01  20.0     0
12018-01-02  20.1     1
22018-01-03  20.2     2
32018-01-04  20.3     3
42018-01-05  20.4     4

To extract the third row from the DataFrame df, we need to use pointers like df.iloc[2] or df.loc['2018-01-03']. Output looks like this:

 -- df.iloc[2] --
tmax    20.2
prcp     2.0
Name: 2018-01-03 00:00:00, dtype: float64

 -- df.loc['2018-01-03'] --
tmax    20.2
prcp     2.0
Name: 2018-01-03 00:00:00, dtype: float64

Warning

df[2] or df['2018-01-03'] don’t work because df is the DataFrame. Those representations are applicable to Pandas.Series but not for Pandas.DataFrame.

In summary, the pointer iloc or loc can extract the row from the Series or DataFrame.

Xarray

In Pandas, we use the pointer iloc and loc to extract the data from the Series and DataFrame. Xarray can also apply these pointers to Xarray.DataArray if DataArray includes index. However, xarray prefers to use coordinates instead of index to manipulate multi-dimensional data array. As a result, we use the pointer isel or sel to identify the coordinate-based position in DataArray and DataSet.

DataArray

Let’s assume that “tmax” is the DataArray as defined in here.

To extract data at the pointer-based position, we apply isel to the DataArray. To extract data at the pointer-based position, we apply isel to the DataArray. Since DataArray has several coordinates, we need to identify it explicitly. When we want to get the first position in time coordinate, we specify like this: .isel(time=0). Here is the output of tmax.isel(time=0):

print(tmax.isel(time=0))
<xarray.DataArray 'tmax' (lat: 179, lon: 360)>
array([[20.8884473 , 20.26124451, 20.87749136, ..., 20.03960973,
      20.34806012, 20.68857054],
     ...,
     [20.33801077, 20.82038127, 20.25744721, ..., 20.40038417,
      20.61148223, 20.30406515]])
Coordinates:
    time     object 2000-01
  * lat      (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0
  * lon      (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
Attributes:
    units:    degC

Similarly, we can get the data using the label-based position sel with identifying the coordinate. Here is an example:

print(tmax.sel(time='2004-05'))
<xarray.DataArray 'tmax' (lat: 179, lon: 360)>
array([[20.68326577, 20.43074489, 20.03587269, ..., 20.37454377,
      20.47737278, 20.05191877],
     ...,
     [20.83891633, 20.96515379, 20.35485984, ..., 20.7117678 ,
      20.60424443, 20.71035624]])
Coordinates:
    time     object 2004-05
  * lat      (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0
  * lon      (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
Attributes:
    units:    degC

Warning

Be careful to select latitude. You will get the totally different result between .isel(lat=0) and .sel(lat=0). The former pick up the first element (typically 90S or 90N), whereas the latter choose the equator (0-degree).

DataSet

Similar to DataArray, we can use the pointers .isel and .sel in DataSet. Here is an example.

  • ds.temp.sel(time=’2003-05’)

  • ds.temp.isel(lon=10)

  • ds.sel(lat=-30)

Note

We didn’t define the index for DataSet here. So the pointer .loc doesn’t work. You will get the error message if you did ds.temp.loc(lat=0).