给定数据集 1
name,x,y
st. peter,1,2
big university portland,3,4
和数据集 2
name,x,y
saint peter3,4
uni portland,5,6
目标是合并
d1.merge(d2, on="name", how="left")
不过,名称上没有完全匹配。所以我正在寻找一种模糊匹配。在这种情况下,技术并不重要,更多的是如何有效地将其整合到熊猫中。
例如,st. peter
可能与另一个saint peter
匹配,但big university portland
偏差可能太大,我们无法将其与uni portland
匹配。
一种考虑方法是允许以最低的 Levenshtein 距离连接,但前提是它低于 5 次编辑(st. --> saint
为 4(。
生成的数据帧应仅包含行st. peter
,并包含"名称"变体以及x
和y
变量。
有没有办法用熊猫做这种合并?
你看过fuzzywuzzy吗?
您可以执行以下操作:
import pandas as pd
import fuzzywuzzy.process as fwp
choices = list(df2.name)
def fmatch(row):
minscore=95 #or whatever score works for you
choice,score = fwp.extractOne(row.name,choices)
return choice if score > minscore else None
df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1,
df2,
left_on='df2_name',
right_on='name',
suffixes=['_df1','_df2'],
how = 'outer') # assuming you want to keep unmatched records
警告 Emptor:我没有尝试运行这个。
假设您有一个函数,它返回最佳匹配(如果有(,否则无:
def best_match(s, candidates):
''' Return the item in candidates that best matches s.
Will return None if a good enough match is not found.
'''
# Some code here.
然后你可以加入它返回的值,但你可以用不同的方式做,这会导致不同的输出(所以我认为,我没有看太多这个问题(:
(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
.merge(df2, on='name', how='left'))
(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
on='name', how='left'))
我现在能得到的最简单的想法是创建所有名称之间有距离的特殊数据帧:
>>> from Levenshtein import distance
>>> df1['dummy'] = 1
>>> df2['dummy'] = 1
>>> merger = pd.merge(df1, df2, on=['dummy'], suffixes=['1','2'])[['name1','name2', 'x2', 'y2']]
>>> merger
name1 name2 x2 y2
0 st. peter saint peter 3 4
1 st. peter uni portland 5 6
2 big university portland saint peter 3 4
3 big university portland uni portland 5 6
>>> merger['res'] = merger.apply(lambda x: distance(x['name1'], x['name2']), axis=1)
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
1 st. peter uni portland 5 6 9
2 big university portland saint peter 3 4 18
3 big university portland uni portland 5 6 11
>>> merger = merger[merger['res'] <= 5]
>>> merger
name1 name2 x2 y2 res
0 st. peter saint peter 3 4 4
>>> del df1['dummy']
>>> del merger['res']
>>> pd.merge(df1, merger, how='left', left_on='name', right_on='name1')
name x y name1 name2 x2 y2
0 st. peter 1 2 st. peter saint peter 3 4
1 big university portland 3 4 NaN NaN NaN NaN