Basics ====== Climate datasets consist of multi-dimensions, typically for time, height (or depth), latitude, and longitude. You may want to get the data value at a specific location and time, such as monthly sea surface temperature anomalies at the equator during January 1998. Or you may just want to know the size of dimensions for the dataset. There are several ways to manipulate the datasets. However, it makes us confused about how to do that. The main reason for this confusion arises from the fact that we are using several python packages. Even though there are many similar features in NumPy, Pandas, and Xarray, those packages have different purposes and objectives. A dataset in NumPy is simply a multi-dimensional array. It doesn't contain any coordinates. Since the Numpy doesn't connect the data array with the coordinate, you must define them in your way. It's more like mathematics. Pandas can convert a NumPy array into a time series dataset. It basically contains a time coordinate for multiple variables. In other words, a data array in Pandas is a 1-dimension dataset (i.e., time dimension). So we require some different ways to manipulate the Pandas array compared with the NumPy array. It is noteworthy that the Pandas array only has the time coordinate but **NOT** for the spatiotemporal multi-coordinates (i.e., time, height, latitude, longitude coordinates). Therefore, the Pandas array is insufficient for manipulating climate datasets. Xarray is the best way to organize climate datasets. Once we define the Xarray dataset appropriately, there are many flexible approaches to manipulate the Xarray dataset. Ultimately, the Xarray dataset includes many methods used in Numpy and Pandas and its own methods. Therefore, you need to ensure what package describes the data array: NumPy, Pandas, or Xarray. .. _BasProp: Data Properties --------------- This section is a summary of common operators. Please see the details of data structure in :ref:`the next section`. NumPy ^^^^^ Let's assume that dat is the numpy array * dat.ndim : Dimension of numpy array * dat.shape : Shape of numpy array * dat.size : Total number of numpy elements Pandas ^^^^^^ Let's assume that sd is the *Series* in Pandas * sd.ndim : Dimension of *Series* array * sd.shape : Shape of *Series* array * sd.size : Total number of *Series* elements * sd.values : Values of *Series* array * sd.index : Index of *Series* array * sd.axes : Axes of *Series* array * sd.name : Name of *Series* array Let's assume that df is the DataFrame in Pandas * df.ndim : Dimension of DataFrame array * df.shape : Shape of DataFrame array * df.size : Total number of DataFrame elements * df.values : Values of DataFrame array * df.index : Index of DataFrame array * df.axes : Axis of DataFrame array * df.attrs : Attributes of DataFrame array * df.columns : Columns of DataFrame array Xarray ^^^^^^ Let's assume that da is the DataArray. * da.ndim : Dimension of DataArray array * da.shape : Shape of DataArray array * da.size : Total number of DataArray elements * da.values : Values of DataArray * da.coords : Coordinates of DataArray * da.attrs : Attributions of DataArray * ds.data : Similar to the data.values Let's assume that ds is the DataSet. * ds.ndim : Dimension of DataSet array * ds.shape : Shape of DataSet array * ds.size : Total number of DataSet elements * ds.values : Values of DataSet * ds.coords : Coordinates of DataSet * ds.attrs : Attributions of DataSet * ds.data_vars : Dictionary of data variables .. _BasStr: Data Structures --------------- References: * `NumPy structure `_ * `Pandas structure `_ * `Xarray structure `_ Pandas ^^^^^^ Pandas has two types of data structure: *Series* and *DataFrame*. The data array **Series** is the one-dimensional array referred to by the index. One example is the time series of maximum daily temperature labeled by date. So, it is a 1-D array with the length of the date. **DataFrame** is a group of *Series*. Each *Series* in DataFrame should have the same index (i.e., the same date coordinate). For example, *DataFrame* contains a time series of daily maximum temperature and daily precipitation labeled by date. So, it is an N-D array with the length of the date where N is the number of variables. .. warning:: *Index* is a confusing concept in Pandas and Xarray. It includes several meatnings. Sometimes, *index* sounds like a label, a coordinate, or a reference. In Pandas, rows and columns both have indexes. Let's assume that you have a table like CSV file. Generally speaking, the most left column would be *index* and the top row would be header. Sample program: :download:`pandas_str-v1.py `. Series ****** Let's make the data array *pandas.Series*. Here, we assume daily maximum temperature (*tmax*) and precipitation (*prcp*) from January 1, 2018, to January 5, 2018. The index of Series data array corresponds to the date for that period. A sample program looks like below. .. literalinclude:: SRC/pandas_str-v1.py :language: python :lines: 1,2,4,5,8-11 :linenos: This sample script made the index *day* using the `panda.date_range `_ function. It covers from January 1, 2018 to January 5, 2018 with 1 day interval. The output looks like below.:: print(day) DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05'], dtype='datetime64[ns]', freq='D') The sample script provided random values of *tmax* and *prcp* based on numpy arraies using the `numpy.random.randn `_ function. We can make the *Serires* data array using the `pandas.Series `_ function. The output of *tmax* looks like:: print(tmax) 2018-01-01 1.183303 2018-01-02 -1.972609 2018-01-03 0.707680 2018-01-04 0.427772 2018-01-05 -0.903869 Freq: D, dtype: float64 The *prcp* variable also has the same dimension size with tmax. The output of *prcp* looks like:: print(prcp) 2018-01-01 -0.016126 2018-01-02 2.151788 2018-01-03 1.134309 2018-01-04 0.015779 2018-01-05 -0.127509 Freq: D, dtype: float64 .. note:: We generated random values using the NumPy function. So, your output of tmax and prcp would differ from here. DataFrame ********* Next, let's create the *DataFrame*. The *DataFrame* contains several *Series* arrays. We can create a new *DataFrame* array (*df*) using the `pandas.DataFrame `_ function. Here, we show the example of two variables (i.e., "tmax" and "prcp"), but you can add more variables as you like. .. literalinclude:: SRC/pandas_str-v1.py :language: python :lines: 15 :linenos: :lineno-start: 9 Here is the output of *df*:: print(df) tmax prcp 2018-01-01 -0.674303 -0.558260 2018-01-02 -0.157591 -0.511053 2018-01-03 1.403665 0.714594 2018-01-04 -0.405556 1.091943 2018-01-05 0.279516 -0.269472 .. _xrStr: Xarray ^^^^^^ As we have seen above, Pandas data arrays have two types. The first type is a single variable of data array with a label, called *Series*. The second type is the multiple variables of data arrays (a suit of *Series*) with the same label, called *DataFrame*. Similarly, Xarray data arrays have two types: *DataArray* and *DataSet*. *DataArray* consists of data value and coordinate. We can still use *index* like *Pandas.Series*. However, *DataArray* contains many coordinates, and it's easy to organize the coordinates instead of the *index*. Since the climate data is a multi-dimensional array, we need to be familiar with *DataArray*. To organize multiple *DataArray*, we can apply the *DataSet*, mainly for reading and writing DataArrays. Sample program: :download:`xarray_str-v1.py `. DataArray ********* Let's check the script below. .. literalinclude:: SRC/xarray_str-v1.py :language: python :lines: 1-23 :linenos: First, we define the values for coordinates. Here, let's consider a 3-dimensional variable consisting of time, latitude, and longitude coordinates. The time coordinate comes from `pandas.period_range `_ function, labeling YYYY-MM format (i.e., 2000-01). We named it "date" (Line 8). We also define latitude and longitude as "lat" and "lon" using a `numpy.arange `_ function (Lines 9-10). So, the lat-lon grid here corresponds to the 1x1 resolution covering from -89 to 89 in latitude and from 0 to 359 in longitude. Next, we make fake values using NumPy functions (Lines 13-14). This part is flexible, and we can skip it if we get the actual values from a data file. Finally, we define the *DataArray* using `xarray.DataArray `_ function as described on Lines 16 and 19. The function defines coordinates and dimensions in "coords" and "dims" options. We can also add "name" and "attrs" to describe more information on Lines 17-18 and 21-22. Here is the output of DataArray "tmax":: print(tmax) array([[[20.17526909, 20.56449617, 20.39136298, ..., 20.44666016, 20.63424445, 20.85352716], ..., [20.24997025, 20.60670297, 20.13117975, ..., 20.14150434, 20.84208498, 20.98607752]]]) Coordinates: * time (time) object 2000-01 2000-02 2000-03 ... 2005-10 2005-11 2005-12 * lat (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0 * lon (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0 Attributes: units: degC DataSet ******* When DataArrays consist of the same coordinates, we can marge them using *DataSet*. This is an example of how to define *DataSet*. .. literalinclude:: SRC/xarray_str-v1.py :language: python :lines: 25-29 :linenos: :lineno-start: 25 Here is the output of DataSet "df":: print(df) Dimensions: (time: 72, lat: 179, lon: 360) Coordinates: * time (time) object 2000-01 2000-02 2000-03 ... 2005-10 2005-11 2005-12 * lat (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0 * lon (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0 Data variables: temp (time, lat, lon) float64 20.88 20.94 20.05 ... 20.33 20.86 20.8 prcp (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 .. _BasIndex: Data Indexing ------------- References: * `Numpy indexing `_ * `Pandas indexing and selecting `_ * `Xarray indexing and selecting `_ Numpy ^^^^^ Sample script :download:`numpy_index-v1.py ` Numpy array use the integer-based position. Let's define two numpy arrays x1 and x2 at first, and see how the integer-based position works.:: --- x1 = np.arange(0, 20, 1) --- [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] --- x2 = np.arange(0, 40, 2) --- [ 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38] The integer-based position starts from 0 to N-length. You cna see the example below how to get the value at the specific position. Negative value indicates the postion from the last. You can seletect the last position using -1. :: --- x1[0] --- 0 --- x1[-1] --- 19 --- x1[3] --- 3 --- x1[-3] --- 17 You can also select a range like this **[start:end:interval]**. :: --- x1[5:] --- [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] --- x1[:5] --- [0 1 2 3 4] --- x1[::2] --- [ 0 2 4 6 8 10 12 14 16 18] --- x1[::-2] --- [19 17 15 13 11 9 7 5 3 1] --- x2[2:8:2] --- [ 4 8 12] Or you can replace the value by selecting the interger-based poistion. Be caution for the end of range:: --- x1[3] = 0 --- before [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] after [ 0 1 2 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] --- x2[3:7] = 0 --- before [ 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38] after [ 0 2 4 0 0 0 0 14 16 18 20 22 24 26 28 30 32 34 36 38] Pandas ^^^^^^ Pandas can use the integer-based position like Numpy array. Also, Pandas has a label-based position. As you can see below, the label-based positioning is useful but confusing because of indexing operators. The Python and NumPy have indexing operators ``[]`` and attribute operator ``.`` , which provides quick and easy access to data structures. These components make several methods for selecting elements as described below. Sample script :download:`pandas_index-v1.py ` Series ****** Pandas use the integer-based position ``iloc`` and the label-based position ``loc``. There are comparable operators: ``iat`` and ``at``. Let's define Pandas.Series "tmax" and see examples using these operators below.:: --- tmax: Series --- 2018-01-01 20.0 2018-01-02 20.1 2018-01-03 20.2 2018-01-04 20.3 2018-01-05 20.4 Freq: D, dtype: float64 Integer-based positioning:: tmax[2] = 20.2 tmax.iloc[2] = 20.2 tmax.iat[2] = 20.2 Label-based positioning:: tmax['2018-01-03'] = 20.2 tmax.loc['2018-01-03'] = 20.2 tmax.at['2018-01-03'] = 20.2 DataFrame ********* Let's defined Pandas.DataFrame array "df" with index=day like this:: --- df: DataFrame --- tmax prcp 2018-01-01 20.0 0 2018-01-02 20.1 1 2018-01-03 20.2 2 2018-01-04 20.3 3 2018-01-05 20.4 4 We can use ``[]`` or ``.`` operators to derive the Series "tmax" from DataFrame "df". Refering ``df['tmax']`` and ``df.tmax`` provide the same result. :: -- df['tmax'] or df.tmax -- 2018-01-01 20.0 2018-01-02 20.1 2018-01-03 20.2 2018-01-04 20.3 2018-01-05 20.4 Freq: D, Name: tmax, dtype: float64 When we select the *Series* "tmax" from the *DataFrame* "df", we can use both index-based and label-based positions, as we have seen in the *Series* above. :: df.tmax[2] = 20.2 df.tmax.iloc[2] = 20.2 df.tmax['2018-01-03'] = 20.2 df.tmax.loc['2018-01-03'] = 20.2 In contrast to the *Series*, the *DataFrame* have to use ``iloc`` or ``loc`` pointing methods. Without pointing mehtod, your script would provide the error messages. Here, the *DataFrame* df contains index, tmax, and prcp in each column like below. .. code-block:: bash :linenos: :emphasize-lines: 3 :lineno-start: 0 2018-01-01 20.0 0 2018-01-02 20.1 1 2018-01-03 20.2 2 2018-01-04 20.3 3 2018-01-05 20.4 4 To extract the third row from the *DataFrame* df, we need to use pointers like ``df.iloc[2]`` or ``df.loc['2018-01-03']``. Output looks like this:: -- df.iloc[2] -- tmax 20.2 prcp 2.0 Name: 2018-01-03 00:00:00, dtype: float64 -- df.loc['2018-01-03'] -- tmax 20.2 prcp 2.0 Name: 2018-01-03 00:00:00, dtype: float64 .. warning:: ``df[2]`` or ``df['2018-01-03']`` don't work because *df* is the DataFrame. Those representations are applicable to *Pandas.Series* but **not** for *Pandas.DataFrame*. In summary, the pointer ``iloc`` or ``loc`` can extract the row from the *Series* or *DataFrame*. Xarray ^^^^^^ In Pandas, we use the pointer ``iloc`` and ``loc`` to extract the data from the *Series* and *DataFrame*. Xarray can also apply these pointers to *Xarray.DataArray* if DataArray includes *index*. However, xarray prefers to use *coordinates* instead of *index* to manipulate multi-dimensional data array. As a result, we use the pointer ``isel`` or ``sel`` to identify the coordinate-based position in *DataArray* and *DataSet*. DataArray ********* Let's assume that "tmax" is the *DataArray* as defined in :ref:`here `. To extract data at the pointer-based position, we apply ``isel`` to the *DataArray*. To extract data at the pointer-based position, we apply ``isel`` to the *DataArray*. Since *DataArray* has several coordinates, we need to identify it explicitly. When we want to get the first position in time coordinate, we specify like this: ``.isel(time=0)``. Here is the output of ``tmax.isel(time=0)``:: print(tmax.isel(time=0)) array([[20.8884473 , 20.26124451, 20.87749136, ..., 20.03960973, 20.34806012, 20.68857054], ..., [20.33801077, 20.82038127, 20.25744721, ..., 20.40038417, 20.61148223, 20.30406515]]) Coordinates: time object 2000-01 * lat (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0 * lon (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0 Attributes: units: degC Similarly, we can get the data using the label-based position ``sel`` with identifying the coordinate. Here is an example:: print(tmax.sel(time='2004-05')) array([[20.68326577, 20.43074489, 20.03587269, ..., 20.37454377, 20.47737278, 20.05191877], ..., [20.83891633, 20.96515379, 20.35485984, ..., 20.7117678 , 20.60424443, 20.71035624]]) Coordinates: time object 2004-05 * lat (lat) float64 -89.0 -88.0 -87.0 -86.0 -85.0 ... 86.0 87.0 88.0 89.0 * lon (lon) float64 0.0 1.0 2.0 3.0 4.0 ... 355.0 356.0 357.0 358.0 359.0 Attributes: units: degC .. warning:: Be careful to select latitude. You will get the totally different result between ``.isel(lat=0)`` and ``.sel(lat=0)``. The former pick up the first element (typically 90S or 90N), whereas the latter choose the equator (0-degree). DataSet ******* Similar to *DataArray*, we can use the pointers ``.isel`` and ``.sel`` in *DataSet*. Here is an example. * ds.temp.sel(time='2003-05') * ds.temp.isel(lon=10) * ds.sel(lat=-30) .. note:: We didn't define the index for *DataSet* here. So the pointer ``.loc`` doesn't work. You will get the error message if you did ``ds.temp.loc(lat=0)``.