Basics¶
Climate datasets consist of multi-dimensions, typically for time, height (or depth), latitude, and longitude. You may want to get the data value at a specific location and time, such as monthly sea surface temperature anomalies at the equator during January 1998. Or you may just want to know the size of dimensions for the dataset. There are several ways to manipulate the datasets. However, it makes us confused about how to do that. The main reason for this confusion arises from the fact that we are using several python packages. Even though there are many similar features in NumPy, Pandas, and Xarray, those packages have different purposes and objectives.
A dataset in NumPy is simply a multi-dimensional array. It doesn’t contain any coordinates. Since the Numpy doesn’t connect the data array with the coordinate, you must define them in your way. It’s more like mathematics.
Pandas can convert a NumPy array into a time series dataset. It basically contains a time coordinate for multiple variables. In other words, a data array in Pandas is a 1-dimension dataset (i.e., time dimension). So we require some different ways to manipulate the Pandas array compared with the NumPy array. It is noteworthy that the Pandas array only has the time coordinate but NOT for the spatiotemporal multi-coordinates (i.e., time, height, latitude, longitude coordinates). Therefore, the Pandas array is insufficient for manipulating climate datasets.
Xarray is the best way to organize climate datasets. Once we define the Xarray dataset appropriately, there are many flexible approaches to manipulate the Xarray dataset. Ultimately, the Xarray dataset includes many methods used in Numpy and Pandas and its own methods. Therefore, you need to ensure what package describes the data array: NumPy, Pandas, or Xarray.
Data Properties¶
This section is a summary of common operators. Please see the details of data structure in the next section.
NumPy¶
Let’s assume that dat is the numpy array
dat.ndim : Dimension of numpy array
dat.shape : Shape of numpy array
dat.size : Total number of numpy elements
Pandas¶
Let’s assume that sd is the Series in Pandas
sd.ndim : Dimension of Series array
sd.shape : Shape of Series array
sd.size : Total number of Series elements
sd.values : Values of Series array
sd.index : Index of Series array
sd.axes : Axes of Series array
sd.name : Name of Series array
Let’s assume that df is the DataFrame in Pandas
df.ndim : Dimension of DataFrame array
df.shape : Shape of DataFrame array
df.size : Total number of DataFrame elements
df.values : Values of DataFrame array
df.index : Index of DataFrame array
df.axes : Axis of DataFrame array
df.attrs : Attributes of DataFrame array
df.columns : Columns of DataFrame array
Xarray¶
Let’s assume that da is the DataArray.
da.ndim : Dimension of DataArray array
da.shape : Shape of DataArray array
da.size : Total number of DataArray elements
da.values : Values of DataArray
da.coords : Coordinates of DataArray
da.attrs : Attributions of DataArray
ds.data : Similar to the data.values
Let’s assume that ds is the DataSet.
ds.ndim : Dimension of DataSet array
ds.shape : Shape of DataSet array
ds.size : Total number of DataSet elements
ds.values : Values of DataSet
ds.coords : Coordinates of DataSet
ds.attrs : Attributions of DataSet
ds.data_vars : Dictionary of data variables
Data Structures¶
References:
Pandas¶
Pandas has two types of data structure: Series and DataFrame. The data array Series is the one-dimensional array referred to by the index. One example is the time series of maximum daily temperature labeled by date. So, it is a 1-D array with the length of the date. DataFrame is a group of Series. Each Series in DataFrame should have the same index (i.e., the same date coordinate). For example, DataFrame contains a time series of daily maximum temperature and daily precipitation labeled by date. So, it is an N-D array with the length of the date where N is the number of variables.
Warning
Index is a confusing concept in Pandas and Xarray. It includes several meatnings. Sometimes, index sounds like a label, a coordinate, or a reference. In Pandas, rows and columns both have indexes. Let’s assume that you have a table like CSV file. Generally speaking, the most left column would be index and the top row would be header.
Sample program: pandas_str-v1.py
.
Series¶
Let’s make the data array pandas.Series. Here, we assume daily maximum temperature (tmax) and precipitation (prcp) from January 1, 2018, to January 5, 2018. The index of Series data array corresponds to the date for that period. A sample program looks like below.
1import numpy as np
2import pandas as pd
3
4day = pd.date_range("2018-01-01", periods=5, freq="D")
5
6tmax = pd.Series(np.random.randn(day.shape[0]), index=day)
7prcp = pd.Series(np.random.randn(day.shape[0]), index=day)
8
This sample script made the index day using the panda.date_range function. It covers from January 1, 2018 to January 5, 2018 with 1 day interval. The output looks like below.:
print(day)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05'],
dtype='datetime64[ns]', freq='D')
The sample script provided random values of tmax and prcp based on numpy arraies using the numpy.random.randn function. We can make the Serires data array using the pandas.Series function. The output of tmax looks like:
print(tmax)
2018-01-01 1.183303
2018-01-02 -1.972609
2018-01-03 0.707680
2018-01-04 0.427772
2018-01-05 -0.903869
Freq: D, dtype: float64
The prcp variable also has the same dimension size with tmax. The output of prcp looks like:
print(prcp)
2018-01-01 -0.016126
2018-01-02 2.151788
2018-01-03 1.134309
2018-01-04 0.015779
2018-01-05 -0.127509
Freq: D, dtype: float64
Note
We generated random values using the NumPy function. So, your output of tmax and prcp would differ from here.
DataFrame¶
Next, let’s create the DataFrame. The DataFrame contains several Series arrays. We can create a new DataFrame array (df) using the pandas.DataFrame function. Here, we show the example of two variables (i.e., “tmax” and “prcp”), but you can add more variables as you like.
9df = pd.DataFrame (data={ 'tmax': tmax, 'prcp': prcp},index=day)
Here is the output of df:
print(df)
tmax prcp
2018-01-01 -0.674303 -0.558260
2018-01-02 -0.157591 -0.511053
2018-01-03 1.403665 0.714594
2018-01-04 -0.405556 1.091943
2018-01-05 0.279516 -0.269472
Xarray¶
As we have seen above, Pandas data arrays have two types. The first type is a single variable of data array with a label, called Series. The second type is the multiple variables of data arrays (a suit of Series) with the same label, called DataFrame. Similarly, Xarray data arrays have two types: DataArray and DataSet.
DataArray consists of data value and coordinate. We can still use index like Pandas.Series. However, DataArray contains many coordinates, and it’s easy to organize the coordinates instead of the index. Since the climate data is a multi-dimensional array, we need to be familiar with DataArray. To organize multiple DataArray, we can apply the DataSet, mainly for reading and writing DataArrays.
Sample program: xarray_str-v1.py
.
DataArray¶
Let’s check the script below.
1import numpy as np
2import pandas as pd
3import xarray as xr
4import os
5
6
7# --- Coordinate ---
8date = pd.period_range("2000-01-01", "2005-12-31", freq="M")
9lat = np.arange(-89, 90, 1, dtype=float)
10lon = np.arange(0, 360, 1, dtype=float)
11
12# --- zero and random values --
13dzero = np.zeros((date.size,lat.size,lon.size), dtype=float)
14drand = np.random.rand(date.size,lat.size,lon.size)
15
16tmax = xr.DataArray(20.+drand, coords=[date,lat,lon], dims=["time","lat","lon"])
17tmax.name = "tmax"
18tmax.attrs["units"] = "degC"
19
20prcp = xr.DataArray(dzero, coords=[date,lat,lon], dims=["time","lat","lon"])
21prcp.name = "prcp"
22prcp.attrs["units"] = "mm/day"
23
First, we define the values for coordinates. Here, let’s consider a 3-dimensional variable consisting of time, latitude, and longitude coordinates. The time coordinate comes from pandas.period_range function, labeling YYYY-MM format (i.e., 2000-01). We named it “date” (Line 8). We also define latitude and longitude as “lat” and “lon” using a numpy.arange function (Lines 9-10). So, the lat-lon grid here corresponds to the 1x1 resolution covering from -89 to 89 in latitude and from 0 to 359 in longitude.
Next, we make fake values using NumPy functions (Lines 13-14). This part is flexible, and we can skip it if we get the actual values from a data file.
Finally, we define the DataArray using xarray.DataArray function as described on Lines 16 and 19. The function defines coordinates and dimensions in “coords” and “dims” options. We can also add “name” and “attrs” to describe more information on Lines 17-18 and 21-22. Here is the output of DataArray “tmax”:
print(tmax)
<xarray.DataArray 'tmax' (time: 72, lat: 179, lon: 360)>
array([[[20.17526909, 20.56449617, 20.39136298, ..., 20.44666016,
20.63424445, 20.85352716],
...,
[20.24997025, 20.60670297, 20.13117975, ..., 20.14150434,
20.84208498, 20.98607752]]])
Coordinates:
* time (time) object 2000-01 2000-02 2000-03 ... 2005-10 2005-11 2005-12
* lat (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0
* lon (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
Attributes:
units: degC
DataSet¶
When DataArrays consist of the same coordinates, we can marge them using DataSet. This is an example of how to define DataSet.
25ds = xr.Dataset({
26 'temp': (['time','lat','lon'],tmax.data),
27 'prcp': (['time','lat','lon'],prcp.data)
28 },
29 coords={ 'time': date, 'lat': lat, 'lon': lon })
Here is the output of DataSet “df”:
print(df)
<xarray.Dataset>
Dimensions: (time: 72, lat: 179, lon: 360)
Coordinates:
* time (time) object 2000-01 2000-02 2000-03 ... 2005-10 2005-11 2005-12
* lat (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0
* lon (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
Data variables:
temp (time, lat, lon) float64 20.88 20.94 20.05 ... 20.33 20.86 20.8
prcp (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Data Indexing¶
References:
Numpy¶
Sample script numpy_index-v1.py
Numpy array use the integer-based position. Let’s define two numpy arrays x1 and x2 at first, and see how the integer-based position works.:
--- x1 = np.arange(0, 20, 1) ---
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
--- x2 = np.arange(0, 40, 2) ---
[ 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38]
The integer-based position starts from 0 to N-length. You cna see the example below how to get the value at the specific position. Negative value indicates the postion from the last. You can seletect the last position using -1.
--- x1[0] ---
0
--- x1[-1] ---
19
--- x1[3] ---
3
--- x1[-3] ---
17
You can also select a range like this [start:end:interval].
--- x1[5:] ---
[ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
--- x1[:5] ---
[0 1 2 3 4]
--- x1[::2] ---
[ 0 2 4 6 8 10 12 14 16 18]
--- x1[::-2] ---
[19 17 15 13 11 9 7 5 3 1]
--- x2[2:8:2] ---
[ 4 8 12]
Or you can replace the value by selecting the interger-based poistion. Be caution for the end of range:
--- x1[3] = 0 ---
before
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
after
[ 0 1 2 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
--- x2[3:7] = 0 ---
before
[ 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38]
after
[ 0 2 4 0 0 0 0 14 16 18 20 22 24 26 28 30 32 34 36 38]
Pandas¶
Pandas can use the integer-based position like Numpy array.
Also, Pandas has a label-based position.
As you can see below, the label-based positioning is useful but confusing because of indexing operators.
The Python and NumPy have indexing operators []
and attribute operator .
, which provides quick and easy access to data structures.
These components make several methods for selecting elements as described below.
Sample script pandas_index-v1.py
Series¶
Pandas use the integer-based position iloc
and the label-based position loc
.
There are comparable operators: iat
and at
.
Let’s define Pandas.Series “tmax” and see examples using these operators below.:
--- tmax: Series ---
2018-01-01 20.0
2018-01-02 20.1
2018-01-03 20.2
2018-01-04 20.3
2018-01-05 20.4
Freq: D, dtype: float64
Integer-based positioning:
tmax[2] = 20.2
tmax.iloc[2] = 20.2
tmax.iat[2] = 20.2
Label-based positioning:
tmax['2018-01-03'] = 20.2
tmax.loc['2018-01-03'] = 20.2
tmax.at['2018-01-03'] = 20.2
DataFrame¶
Let’s defined Pandas.DataFrame array “df” with index=day like this:
--- df: DataFrame ---
tmax prcp
2018-01-01 20.0 0
2018-01-02 20.1 1
2018-01-03 20.2 2
2018-01-04 20.3 3
2018-01-05 20.4 4
We can use []
or .
operators to derive the Series “tmax” from DataFrame “df”.
Refering df['tmax']
and df.tmax
provide the same result.
-- df['tmax'] or df.tmax --
2018-01-01 20.0
2018-01-02 20.1
2018-01-03 20.2
2018-01-04 20.3
2018-01-05 20.4
Freq: D, Name: tmax, dtype: float64
When we select the Series “tmax” from the DataFrame “df”, we can use both index-based and label-based positions, as we have seen in the Series above.
df.tmax[2] = 20.2
df.tmax.iloc[2] = 20.2
df.tmax['2018-01-03'] = 20.2
df.tmax.loc['2018-01-03'] = 20.2
In contrast to the Series,
the DataFrame have to use iloc
or loc
pointing methods.
Without pointing mehtod, your script would provide the error messages.
Here, the DataFrame df contains index, tmax, and prcp in each column like below.
02018-01-01 20.0 0
12018-01-02 20.1 1
22018-01-03 20.2 2
32018-01-04 20.3 3
42018-01-05 20.4 4
To extract the third row from the DataFrame df,
we need to use pointers like df.iloc[2]
or df.loc['2018-01-03']
.
Output looks like this:
-- df.iloc[2] --
tmax 20.2
prcp 2.0
Name: 2018-01-03 00:00:00, dtype: float64
-- df.loc['2018-01-03'] --
tmax 20.2
prcp 2.0
Name: 2018-01-03 00:00:00, dtype: float64
Warning
df[2]
or df['2018-01-03']
don’t work because df is the DataFrame.
Those representations are applicable to Pandas.Series but not for Pandas.DataFrame.
In summary, the pointer iloc
or loc
can extract the row from the Series or DataFrame.
Xarray¶
In Pandas, we use the pointer iloc
and loc
to extract the data from the Series and DataFrame.
Xarray can also apply these pointers to Xarray.DataArray if DataArray includes index.
However, xarray prefers to use coordinates instead of index to manipulate multi-dimensional data array.
As a result, we use the pointer isel
or sel
to identify the coordinate-based position in DataArray and DataSet.
DataArray¶
Let’s assume that “tmax” is the DataArray as defined in here.
To extract data at the pointer-based position, we apply isel
to the DataArray.
To extract data at the pointer-based position, we apply isel
to the DataArray.
Since DataArray has several coordinates, we need to identify it explicitly.
When we want to get the first position in time coordinate, we specify like this: .isel(time=0)
.
Here is the output of tmax.isel(time=0)
:
print(tmax.isel(time=0))
<xarray.DataArray 'tmax' (lat: 179, lon: 360)>
array([[20.8884473 , 20.26124451, 20.87749136, ..., 20.03960973,
20.34806012, 20.68857054],
...,
[20.33801077, 20.82038127, 20.25744721, ..., 20.40038417,
20.61148223, 20.30406515]])
Coordinates:
time object 2000-01
* lat (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0
* lon (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
Attributes:
units: degC
Similarly, we can get the data using the label-based position sel
with identifying the coordinate.
Here is an example:
print(tmax.sel(time='2004-05'))
<xarray.DataArray 'tmax' (lat: 179, lon: 360)>
array([[20.68326577, 20.43074489, 20.03587269, ..., 20.37454377,
20.47737278, 20.05191877],
...,
[20.83891633, 20.96515379, 20.35485984, ..., 20.7117678 ,
20.60424443, 20.71035624]])
Coordinates:
time object 2004-05
* lat (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0
* lon (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0
Attributes:
units: degC
Warning
Be careful to select latitude. You will get the totally different result between .isel(lat=0)
and .sel(lat=0)
. The former pick up the first element (typically 90S or 90N), whereas the latter choose the equator (0-degree).
DataSet¶
Similar to DataArray, we can use the pointers .isel
and .sel
in DataSet.
Here is an example.
ds.temp.sel(time=’2003-05’)
ds.temp.isel(lon=10)
ds.sel(lat=-30)
Note
We didn’t define the index for DataSet here.
So the pointer .loc
doesn’t work.
You will get the error message if you did ds.temp.loc(lat=0)
.