我有两个带有用户详细信息和分数的DataFrames。一些用户有第二个分数,并且将出现在第二个DataFrame中。我想做的是将它们连接或合并在一起(不介意是哪一个(,以获得它们的最终分数,即如果它们存在于第二个DataFrame中,则从那里获取,否则从第一个开始。
DataFrame一:
first_name last_name email score feedback
0 Bill First user1@example.com 57.0 Needs to try again
1 Anna Second user2@example.com 10.0 Not enough to proceed
2 Fred Third user3@example.com 100.0 Excellent!
数据帧二:
first_name last_name email score feedback
0 Bill First user1@example.com 100.0 Much better!
期望结果:
first_name last_name email score feedback
0 Bill First user1@example.com 100.0 Much better!
1 Anna Second user2@example.com 10.0 Not enough to proceed
2 Fred Third user3@example.com 100.0 Excellent!
我得到的最接近的是df1.merge(df2, on=['email', 'first_name', 'last_name', ], how='left')
,但这给了我列:first_name last_name email score_x feedback_x score_y feedback_y
使用pd.concat
和drop_duplicates
:
out = pd.concat([df2, df1]).drop_duplicates(['first_name', 'last_name', 'email'])
print(out)
# Output
first_name last_name email score feedback
0 Bill First user1@example.com 100.0 Much better!
1 Anna Second user2@example.com 10.0 Not enough to proceed
2 Fred Third user3@example.com 100.0 Excellent!
假设first_name
、last_name
和email
字段是标识类似merge
的记录的主键。
就在drop_duplicates
之前,您的数据帧看起来像:
>>> pd.concat([df2, df1])
first_name last_name email score feedback
0 Bill First user1@example.com 100.0 Much better!
0 Bill First user1@example.com 57.0 Needs to try again # dupe
1 Anna Second user2@example.com 10.0 Not enough to proceed
2 Fred Third user3@example.com 100.0 Excellent!
使用combine_first
:的替代方案
cols = ['email', 'first_name', 'last_name']
out = df2.set_index(cols).combine_first(df1.set_index(cols)).reset_index()
输出:
email first_name last_name score feedback
0 user1@example.com Bill First 100.0 Much better!
1 user2@example.com Anna Second 10.0 Not enough to proceed
2 user3@example.com Fred Third 100.0 Excellent!