将一个时间序列的值替换为熊猫中另一个时间序列的值



我有两个数据帧:

s1: 
time     X1
0  1234567000  96.32
1  1234567005  96.01
2  1234567009  96.05
s2: 
time     X2
0  1234566999  23.88
1  1234567006  23.96

我想用第二个数据帧替换第一个时间序列/数据帧的值,同时保留时间戳,以获得:

frame: 
time     X2
0  1234567000  23.88
1  1234567005  23.88
2  1234567009  23.96

输出 (frame( 应该有s1的时间戳,但值应该s2time是整数(它不是 UNIX 时间戳(。X1X2是浮动的。

有没有巧妙的方法可以对付熊猫?


我目前使用外连接/合并 + fillna + 内连接/合并 + del 列的链,但这似乎效率不高。

from __future__ import print_function
import pandas as pd
def merge_dataframes(s1, s2, common_column, back_fill=False, verbose=False):
if verbose: print('s1: n{0}'.format(s1))
if verbose: print('s2: n{0}'.format(s2))
frame = pd.merge(s1,s2,how='outer').sort_values(by=common_column)
if verbose: print('frame: n{0}'.format(frame))
frame.fillna(method='ffill', inplace=True)
if verbose: print('frame: n{0}'.format(frame))
frame = pd.merge(frame,s1,how='inner').sort_values(by=common_column)
if verbose: print('frame: n{0}'.format(frame))        
for column_name in s1.columns:
if (column_name not in common_column) and (column_name not in s2.columns):
del frame[column_name]
if back_fill:
frame.fillna(method='bfill', inplace=True)
if verbose: print('frame: n{0}'.format(frame))            
return frame
def main():
'''
Demonstrate the use of merge_dataframes(s1, s2, common_column)
'''
s1 = pd.DataFrame({
'time':[1234567000,1234567005,1234567009],
'X1':[96.32,96.01,96.05]
},columns=['time','X1'])  
s2 = pd.DataFrame({
'time':[1234566999,1234567006],
'X2':[23.88,23.96]
},columns=['time','X2'])  
common_column = 'time'
frame = merge_dataframes(s1, s2, common_column, verbose=True)
print('frame: n{0}'.format(frame))
if __name__ == "__main__":
main()
#cProfile.run('main()') # if you want to do some profiling

这是我的解决方案,我分解了步骤。

过去仅搜索第一

M1=pd.DataFrame({},index=df1.time,columns=df2.time)
M1=M1.apply(lambda x:x.index-x.name)
del M1.index.name
M2=M1.stack().reset_index()
M2=M2.loc[M2[0]>=0,]
M2[0]=abs(M2[0])
M2=M2.sort_values(['level_0',0]).drop_duplicates(['level_0'],keep='first')
df1.merge(M2,left_on='time',right_on='level_0',how='left').merge(df2,left_on='time_y',right_on='time').loc[:,['time_x','X1','X2']]
time_x     X1     X2
0  1234567000  96.32  23.88
1  1234567005  96.01  23.88
2  1234567009  96.05  23.96

第2次搜索全部:唯一不同的是M2=M2.loc[M2[0]>=0,]

M1=pd.DataFrame({},index=df1.time,columns=df2.time)
M1=M1.apply(lambda x:x.index-x.name)
del M1.index.name
M2=M1.stack().reset_index()
#M2=M2.loc[M2[0]>=0,]
M2[0]=abs(M2[0])
M2=M2.sort_values(['level_0',0]).drop_duplicates(['level_0'],keep='first')
df1.merge(M2,left_on='time',right_on='level_0',how='left').merge(df2,left_on='time_y',right_on='time').loc[:,['time_x','X1','X2']]

Out[173]: 
time_x     X1     X2
0  1234567000  96.32  23.88
1  1234567005  96.01  23.96
2  1234567009  96.05  23.96

使用itertoolsproduct更新

from itertools import product
import pandas as pd
DF=pd.DataFrame(list(product(df1.time, df2.time)), columns=['l1', 'l2'])
DF['DIFF']=DF.l1-DF.l2
DF=DF.loc[DF.DIFF>=0,]
DF=DF.sort_values(['l1','DIFF']).drop_duplicates(['l1'],keep='first')
df1.merge(DF,left_on='time',right_on='l1',how='left').merge(df2,left_on='l2',right_on='time').loc[:,['time_x','X1','X2']]

Out[357]: 
time_x     X1     X2
0  1234567000  96.32  23.88
1  1234567005  96.01  23.88
2  1234567009  96.05  23.96
pd.merge_asof

对您的样本有用

pd.merge_asof(s1,s2,on='time')
Out[108]: 
time     X1     X2
0  1234567000  96.32  23.88
1  1234567005  96.01  23.88
2  1234567009  96.05  23.96

编辑 - 绝对合并的解决方案

def Matcher2(value,mat):
return np.argmin(np.absolute(mat-value))
mat = s2.time.as_matrix()
s1['dex'] = s1.time.apply(lambda row: Matcher2(row,mat))
mg = pd.merge(s1,s2,left_on='dex',right_index=True,how='left')
print mg[['time_x','X1','X2']]
time_x     X1     X2
0  1234567000  96.32  23.88
1  1234567005  96.01  23.96
2  1234567009  96.05  23.96

最新更新