使用 python-pandas 实现 SQL "MERGE INTO" 命令的最佳方法?



我想用2个panda数据帧进行条件追加,类似于合并到SQL函数。对于源数据帧中的每一行,如果索引不存在,请将其插入到目标数据帧中。如果索引确实存在,请检查辅助条件。如果满足条件,请更新现有行。

这里有一个例子:

import pandas as pd
df1 = pd.DataFrame([{'index':'1st','checkval':2,'storeval':'elephant'},
{'index':'2nd','checkval':7,'storeval':'giraffe'}]).set_index('index')
df2 = pd.DataFrame([{'index':'1st','checkval':3,'storeval':'hippopotamus'},
{'index':'3rd','checkval':4,'storeval':'seagull'}]).set_index('index')

以下是df1看起来像

checkval    storeval
index       
1st     2           elephant
2nd     7           giraffe

以下是df2看起来像

checkval   storeval
index       
1st     3       hippopotamus
3rd     4       seagull

这里有一种暴力的方式来做我所描述的:

for ind2, row2 in df2.iterrows():
found = False
for ind1, row1 in df1.iterrows():
if ind2 == ind1:
#Index matched
found = True
if row2['checkval'] > row1['checkval']:
#Conditions met, updating existing row
df1.loc[ind1] = row2
if not found:
# Row not already in df, insert
df1 = df1.append(row2)

输出为:

checkval    storeval
index       
1st     3   hippopotamus
2nd     7   giraffe
3rd     4   seagull

然而,我很想找到一些像这样的内置功能

df1.merge(d2, how = 'left', conditions = lambda df1,df2: df2['checkval']>df1['checkval'])

或者类似的东西。有人对如何改进"蛮力"方法有什么建议吗?

尽量不要在panda中创建不必要的循环,这会减慢速度并扰乱代码

我认为我们可以使用DataFrame.append和之前的groupby.last排序:

new_df = df1.append(df2).sort_values('checkval').groupby(level=0).last()
#new_df = df1.append(df2).sort_values('checkval').groupby(level='index').last()

备选方案:

new_df = df1.append(df2)
new_df = new_df.loc[~new_df.sort_values('checkval')
.index
.duplicated(keep='last'),:].sort_index()
print(new_df)

new_df = (df1.append(df2)
.reset_index()
.sort_values('checkval')
.drop_duplicates(subset='index',keep='last')
.set_index('index')
.sort_index())
print(new_df)

输出

checkval      storeval
index                        
1st           3  hippopotamus
2nd           7       giraffe
3rd           4       seagull

相关内容

最新更新