Merge and Remove duplicated data

This page shows how to check the timestamp of a new dataset, insert this data in the correct time series location in master file, and remove duplicated data.

Function of data insertion

        # ----- Read TWO (new entry and master) CSV files -----
        df1 = pd.read_csv(indir1+filename1 \
          , delimiter = ',', na_values = ['NAN','"NAN"'], skiprows=[0,2,3], header=0, low_memory=False, index_col=0)
        dfm = pd.read_csv(indirm+filenamem \
          , delimiter = ',', na_values = ['NAN','"NAN"'], skiprows=[0,2,3], header=0, low_memory=False, index_col=0)
        #--- add additional information of headers ----

        if stid1 == stidm:

              time1 =pd.Series(pd.to_datetime(df1.index)) 
              timem =pd.Series(pd.to_datetime(dfm.index)) 

              newdf = insert_dat(dfm,df1,timem,time1,func)
              
              timestamp = pd.Series(newdf.index)
              newdf.index = timestamp
              newdf['RECORD'] = range(0,newdf.shape[0])

‘df1’ is the data of the new dataset, and ‘dfm’ is the data of the master file at each station.

Data insertion is performed when the station IDs (stid1, stidm) of the two files are the same.

Subroutine of data insertion

This subroutine defiles the timestamps of the new dataset and master file and determines wherein the master file the new data will be inserted.

def insert_dat(dfm,df1,timem,time1,func):
    if time1.iloc[-1] <= timem.iloc[0]:
       print("New entry is old than the masters: ",timem[0],'-',timem.iloc[-1],' ',time1.iloc[0],', ',time1.iloc[-1])
       newdf = pd.concat([df1,dfm],axis=0)
    else:
       if time1.iloc[0] > timem.iloc[-1]:
          print("new entry is new than the masters: ",timem[0],'-',timem.iloc[-1],' ',time1.iloc[0],', ',time1.iloc[-1])
          newdf = pd.concat([dfm,df1],axis=0)
       else:
          print("some new entry overlap with the masters:",timem[0],'-',timem.iloc[-1],' ',time1.iloc[0],', ',time1.iloc[-1])

          df1['label'] = 0
          dfm['label'] = 1
          newdf0 = pd.concat([df1,dfm], ignore_index=False)
          newdf0['TIME2']=newdf0.index

          # sort data in chronological data
          newdf0.sort_values(by = ['TIME2','label'],inplace=True)

          # drop the data to be duplicated, leave the 'func'th data of the duplicated data
          newdf = newdf0.drop_duplicates(subset=['TIME2'],keep=func)

          newdf = newdf.drop(['TIME2'],axis=1)
          newdf = newdf.drop(['label'],axis=1)

    return newdf

Sample program for inserting (time series) dataset into a master file

This program gets a filename under multiple directories (station names) and reads the data in one file in each directory.

Source file: merge_2files.v02.py