我想用2个panda数据帧进行条件追加,类似于合并到SQL函数。对于源数据帧中的每一行,如果索引不存在,请将其插入到目标数据帧中。如果索引确实存在,请检查辅助条件。如果满足条件,请更新现有行。
这里有一个例子:
import pandas as pd
df1 = pd.DataFrame([{'index':'1st','checkval':2,'storeval':'elephant'},
{'index':'2nd','checkval':7,'storeval':'giraffe'}]).set_index('index')
df2 = pd.DataFrame([{'index':'1st','checkval':3,'storeval':'hippopotamus'},
{'index':'3rd','checkval':4,'storeval':'seagull'}]).set_index('index')
以下是df1看起来像
checkval storeval
index
1st 2 elephant
2nd 7 giraffe
以下是df2看起来像
checkval storeval
index
1st 3 hippopotamus
3rd 4 seagull
这里有一种暴力的方式来做我所描述的:
for ind2, row2 in df2.iterrows():
found = False
for ind1, row1 in df1.iterrows():
if ind2 == ind1:
#Index matched
found = True
if row2['checkval'] > row1['checkval']:
#Conditions met, updating existing row
df1.loc[ind1] = row2
if not found:
# Row not already in df, insert
df1 = df1.append(row2)
输出为:
checkval storeval
index
1st 3 hippopotamus
2nd 7 giraffe
3rd 4 seagull
然而,我很想找到一些像这样的内置功能
df1.merge(d2, how = 'left', conditions = lambda df1,df2: df2['checkval']>df1['checkval'])
或者类似的东西。有人对如何改进"蛮力"方法有什么建议吗?
尽量不要在panda中创建不必要的循环,这会减慢速度并扰乱代码
我认为我们可以使用DataFrame.append
和之前的groupby.last
排序:
new_df = df1.append(df2).sort_values('checkval').groupby(level=0).last()
#new_df = df1.append(df2).sort_values('checkval').groupby(level='index').last()
备选方案:
new_df = df1.append(df2)
new_df = new_df.loc[~new_df.sort_values('checkval')
.index
.duplicated(keep='last'),:].sort_index()
print(new_df)
new_df = (df1.append(df2)
.reset_index()
.sort_values('checkval')
.drop_duplicates(subset='index',keep='last')
.set_index('index')
.sort_index())
print(new_df)
输出
checkval storeval
index
1st 3 hippopotamus
2nd 7 giraffe
3rd 4 seagull