我有一个pandas-df,我已经将名称与ID匹配,但有些ID没有名称。对于这些,我想回到映射文件,搜索"alternative_ID_list"列,看看是否有与相应名称匹配的项。
current df
name ID
0 joe USER1
3 mary USER2
5 USER3 USER3
8 USER4 USER4
9 USER5 USER5
9 USER6 USER6
bad_matches=[3, 4, 5, 6]
mapping_df =
name ID alternative_ID_list
0 joe USER1 USER213.32
3 mary USER2 USER643.11
5 sam USER98 USER31.5
7 jack USER992 USER4.2
8 rick USER902 USER5.6, USER321.1
9 john USER979 USER6.8, USER987.9
10 jay USER980 USER479.2, USER989.0
#use mapping_df to find the bad_match_IDs (take the first match found if multiple rows for one bad_match_id)
desired
name ID
0 joe USER1
3 mary USER2
5 USER3 USER3
7 jack USER4
8 rick USER5
9 john USER6
首先用DataFrame.explode
拆分列alternative_ID_list
,转换为整数并通过bad_matches
进行筛选,以便通过左联接的DataFrame.merge
与原始DataFrame进行可能的匹配,最后设置相同的索引并替换DataFrame.fillna
:中匹配的行
df1 = (mapping_df.assign(alternative_ID_list=mapping_df.alternative_ID_list.str.split(', '))
.explode('alternative_ID_list')
.astype({'alternative_ID_list':int})
.drop_duplicates('alternative_ID_list')
.loc[lambda x: x['alternative_ID_list'].isin(bad_matches)])
print (df1)
name ID alternative_ID_list
7 jack 992 379
8 rick 902 579
9 john 979 479
f = lambda x: x.strip('_')
df1 = df.merge(df1,
left_on='ID',
right_on='alternative_ID_list',
how='left',
suffixes=('','_'))[['name_','ID_']].rename(columns=f)
df = df1.set_index(df.index).fillna(df).astype({'ID':int})
print (df)
name ID
0 joe 123
3 mary 342
5 ID/214 214
8 jack 992
9 rick 902
9 john 979
EDIT:因为只替换列name
的解决方案被Series.map
:简化了
mapping = [str(x) for x in bad_matches]
df1 = (mapping_df.assign(alternative_ID_list=mapping_df.alternative_ID_list.str.split(', '))
.explode('alternative_ID_list')
.assign(alternative_ID_list = lambda x: x.alternative_ID_list.str.split('.').str[0])
.drop_duplicates('alternative_ID_list')
.loc[lambda x: x['alternative_ID_list'].str.extract('(d+)$', expand=False).isin(mapping)]
)
print (df1)
name ID alternative_ID_list
7 jack USER992 USER4
8 rick USER902 USER5
9 john USER979 USER6
df['name'] = df['ID'].map(df1.set_index('alternative_ID_list')['name']).fillna(df['name'])
print (df)
name ID
0 joe USER1
3 mary USER2
5 USER3 USER3
8 jack USER4
9 rick USER5
9 john USER6