Reading time-series station data
This page provides how to read hourly time-series station data and create timestamps from the days of the year using Python.
The sample Python program for reading the hourly data is as follows.
Setup the modules
Importing the Python module to access the script from another Python file or module.
import csv
import glob
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import calendar
import shutil
import os
import sys
from datetime import datetime, date
Read the CSV data
The first step is to read the CSV data.
#print(' ----- Read CSV file ------')
df = pd.read_csv(findir+filename+'.csv' \
, delimiter = ',', skiprows=[0,2,3], na_values = ['NAN','"NAN"'],header=0)
‘findir’ is the directory name where the data are stored, and ‘filename’ is the file name.
The data contains four header rows. Therefore, skip three rows and use one header row as headers (column names).
The original dataset contains several types of missing values, such as NAN, “NAN”, and “””NAN”””. Add two kinds of NAN and “NAN” to the missing values information in the read module. (Python seems to recognize “NAN” and “””NAN””” as the same format).
Generate timestamps (YY-MM-DD HH:MM:SS)
Next, generating timestamps using the days of the year and hour information.
# --- create timestamp
# convert 'the days of the year' and time to the date/time (YYYY-MM-DD HH:MM:SS)
timestamp = dayofY_toTimeStamp(df)
def dayofY_toTimeStamp(df):
newdate = pd.to_datetime(df.loc[:,'Year'].values*10000000+df.loc[:,'Day_of_Year'].values*10000+df.loc[:,'HrMin'], format='%Y%j%H%M')
return newdate
dayofY_toTimeStamp is the subroutine to pick up the date and time (YY-MM-DD HH:MM:SS).
Create new data frame
Create a new empty data frame with the timestamp index, add some data columns from the original data, and concatenate other datasets along wth the columns.
# --- create a new dataframe
df2 = pd.DataFrame(index=timestamp)
df2.index.names = [tlabel]
df2['RECORD']=df.loc[:,'RECORD'].values
df2['SiteNum']=stid.values
# Extract data from header name "Year" to "XMTPWR"
tmp = df.loc[:,'Year':'XMTPWR']
tmp[tlabel] = timestamp
tmp.set_index(tlabel, inplace=True)
# Concatenating two datasets along with the column (axis=1)
df2 = pd.concat([df2,tmp],axis=1)
Save the data
Save the data to a CSV file with adding required header information.
#----
# save the data
#----
print('save')
print(newdf)
save_csv_file(newdf,outdir,iyr,locname,head)
def save_csv_file(dat,outdir,iyr,filename,head):
tmpdir = 'tmp/'
filelist = glob.glob(os.path.join(tmpdir, "*"))
for f in filelist:
os.remove(f)
dat.to_csv(tmpdir+filename+'.csv',index=True, na_rep='NAN')
# add header information to the datasets
fla = tmpdir+os.path.splitext(filename)[0]
addheaders(fla,fla+'.2','.csv',head[0],0)
addheaders(fla+'.2',fla+'.3','.csv',head[2],2)
# save the dataset in the 'outdir' directory
newfile = outdir+iyr+'/'+filename
filen = newfile + '.csv'
os.system('rm '+filen)
addheaders(fla+'.3',newfile,'.csv',head[3],3)
print(newfile+'.csv')
def addheaders(filename1,filename2,ext,head,inum):
with open(filename1+ext, "r") as infile:
reader = list(csv.reader(infile))
reader.insert(inum, head)
with open(filename2+ext, "w") as outfile:
writer = csv.writer(outfile)
for line in reader:
writer.writerow(line)
‘newfile’ is the final data.
Organize file and directory structure
This program raads all files under multiple directories, merges multiple files into one file per station each year, and saves the files.
Initial directory structure:
Output directory structure:
Source file: read_merge_files.py