Fater update pandas DataFrame

i有一个名为 df的数据框架具有GENDER，AGE和ID等列等列，并且还有另一个名为df_2的数据框，它也只有3列GENDER，AGE和ID。我想在df中更新GENDER和AGE的值，并使用df_2的值。

所以我的想法是

df_id = df.ID.tolist()
df_2_id = df_2.ID.tolist()
df = df.set_index('ID')
df_2 = df_2.set_index('ID')
# all the ids in df_2_id are in df_id
for id in tqdm.tqdm_notebook(df_2_id):
    df.loc[id, 'GENDER'] = df_2.loc[id, 'GENDER']
    df.loc[id, 'AGE'] = df_2.loc[id, 'AGE']

但是，FO循环仅具有每秒17.2次迭代，并且需要2个小时才能更新数据。我该如何使其更快？

我认为您需要索引的第一个intersection，然后设置值：

idx = df.index.intersection(df_2.index)
df.loc[idx, 'GENDER'] = df_2['GENDER']
df.loc[idx, 'AGE'] = df_2['AGE']

或concat一起删除重复项，保留最后值：

df = pd.concat([df, df_2])
df = df[~df.index.duplicated(keep='last')]

类似的解决方案：

df = pd.concat([df, df_2]).reset_index().drop_duplicates('ID', keep='last')

样本：

df = pd.DataFrame({'ID':list('abcdef'),
                   'AGE':[5,3,6,9,2,4],
                   'GENDER':list('aaabbb')})
#print (df)

df_2 = pd.DataFrame({'ID':list('def'),
                   'AGE':[90,20,40],
                   'GENDER':list('eee')})
#print (df_2)
df = df.set_index('ID')
df_2 = df_2.set_index('ID')
idx = df.index.intersection(df_2.index)
df.loc[idx, 'GENDER'] = df_2['GENDER']
df.loc[idx, 'AGE'] = df_2['AGE']
print (df)
    AGE GENDER
ID            
a     5      a
b     3      a
c     6      a
d    90      e
e    20      e
f    40      e

相关内容

最新更新

热门标签：