将不正确的匹配映射到其他数据帧



我有一个pandas-df,我已经将名称与ID匹配,但有些ID没有名称。对于这些,我想回到映射文件,搜索"alternative_ID_list"列,看看是否有与相应名称匹配的项。

current df
name   ID  
0     joe  USER1    
3    mary  USER2    
5  USER3    USER3    
8  USER4    USER4    
9  USER5    USER5    
9  USER6    USER6    

bad_matches=[3, 4, 5, 6]
mapping_df =
name   ID   alternative_ID_list
0     joe  USER1        USER213.32
3    mary  USER2        USER643.11
5     sam  USER98        USER31.5
7    jack  USER992      USER4.2
8    rick  USER902      USER5.6, USER321.1
9    john  USER979      USER6.8, USER987.9
10   jay   USER980      USER479.2, USER989.0
#use mapping_df to find the bad_match_IDs (take the first match found if multiple rows for one bad_match_id) 
desired 
name   ID  
0    joe     USER1    
3    mary    USER2    
5    USER3    USER3    
7    jack    USER4    
8    rick    USER5    
9    john    USER6    

首先用DataFrame.explode拆分列alternative_ID_list,转换为整数并通过bad_matches进行筛选,以便通过左联接的DataFrame.merge与原始DataFrame进行可能的匹配,最后设置相同的索引并替换DataFrame.fillna:中匹配的行

df1 = (mapping_df.assign(alternative_ID_list=mapping_df.alternative_ID_list.str.split(', '))
.explode('alternative_ID_list')
.astype({'alternative_ID_list':int})
.drop_duplicates('alternative_ID_list')
.loc[lambda x: x['alternative_ID_list'].isin(bad_matches)])
print (df1)
name   ID  alternative_ID_list
7  jack  992                  379
8  rick  902                  579
9  john  979                  479
f = lambda x: x.strip('_')
df1 = df.merge(df1, 
left_on='ID', 
right_on='alternative_ID_list', 
how='left', 
suffixes=('','_'))[['name_','ID_']].rename(columns=f)

df = df1.set_index(df.index).fillna(df).astype({'ID':int})
print (df)
name   ID
0     joe  123
3    mary  342
5  ID/214  214
8    jack  992
9    rick  902
9    john  979

EDIT:因为只替换列name的解决方案被Series.map:简化了

mapping = [str(x) for x in bad_matches]
df1 = (mapping_df.assign(alternative_ID_list=mapping_df.alternative_ID_list.str.split(', '))
.explode('alternative_ID_list')
.assign(alternative_ID_list = lambda x: x.alternative_ID_list.str.split('.').str[0])
.drop_duplicates('alternative_ID_list')
.loc[lambda x: x['alternative_ID_list'].str.extract('(d+)$', expand=False).isin(mapping)]
)
print (df1)
name       ID alternative_ID_list
7  jack  USER992               USER4
8  rick  USER902               USER5
9  john  USER979               USER6
df['name'] = df['ID'].map(df1.set_index('alternative_ID_list')['name']).fillna(df['name'])
print (df)
name     ID
0    joe  USER1
3   mary  USER2
5  USER3  USER3
8   jack  USER4
9   rick  USER5
9   john  USER6

最新更新