我的df1有200k条记录,df2有9k条记录。我想检查电子邮件isd是否相同,如果相同,则检查唯一id,如果唯一id不同,则用df2唯一id替换df1唯一id。
我尝试过这个代码,但它花费了太多时间。如何减少它的时间?
...: for index2, row2 in df2.iterrows():
...: #print(row2['Email'])
...: if row1['email']==row2['Email']:
...: #print(row1['email'],row2['Email'])
...: if int(row1['unique_ID'])!=int(row2['unique_ID']):
...: print("yes")
...: row1['unique_ID']=row2['unique_ID']
...: df1.loc[index1]=row1```
Iterrows非常慢,正如您所看到的。使用merge、groupby和filtering查找相关行,并同时更改所有行的ID。您可以使用groupby,然后计算每个电子邮件有多少个唯一ID。这里有一个玩具示例:
row1list = ['stack', '10']
row2list = ['overflow', '20']
row3list = ['overflow', '30']
df1 = pd.DataFrame([row1list, row2list, row3list], columns=['email', 'unique_ID'])
row1list = ['stack', '10']
row2list = ['overflow', '40']
df2 = pd.DataFrame([row1list, row2list], columns=['email', 'unique_ID'])
df_conflicting_ids = df1.groupby('email', as_index=False).agg({
'unique_ID': lambda x: len(pd.Series.unique(x))})
df_conflicting_ids = df_conflicting_ids.rename(columns={'unique_ID':'unique_ID_count'})
df_conflicting_ids = df_conflicting_ids[df_conflicting_ids['unique_ID_count'] > 1]
print(df_conflicting_ids)
# email unique_ID_count
# 0 overflow 2
del df_conflicting_ids['unique_ID_count'] # don't need column anymore
df_conflicting_ids = df_conflicting_ids.merge(df2, on='email', how='left')
df_conflicting_ids = df_conflicting_ids.rename(columns={'unique_ID':'master_unique_ID'})
df1 = df1.merge(df_conflicting_ids, on='email', how='left')
df1.loc[df1['master_unique_ID'].notnull(), 'unique_ID'] = df1['master_unique_ID']
print(df1)
# email unique_ID master_unique_ID
# 0 stack 10 NaN
# 1 overflow 40 40
# 2 overflow 40 40
del df1['master_unique_ID'] # don't need column anymore
我不确定在覆盖unique_ID后是否要删除重复项。此外,您可能希望将unique_ID存储为整数,因为您是在转换为整数后进行测试的。