******使用以下解决方案编辑*******
我已经仔细阅读了指导方针,希望这个问题是可以接受的。
我有两个Panda数据帧,我需要在目标列和参考列上应用模糊匹配函数,并根据相似性得分合并数据,保留原始数据。
我检查过类似的问题,例如,请参阅:
有可能用python pandas进行模糊匹配合并吗?但我无法使用此解决方案。到目前为止,我有:
df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
import distance
fulldf=[]
for name1 in df1['RefName']:
for name2 in df2['TarName']:
if distance.jaccard(name1, name2)<0.6:
fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)
如何在最终输出中包括"NameId"one_answers"Type"(以及最终的其他列(,例如:
df1_NameId RefName df1_Type df1_NewColumn Score df2_NameId TarName df2_Type df2_NewColumn 1 robert johnes Person … 0.0000 1 roberto johnes Person …
有没有一种方法可以对其进行编码,使其易于扩展,并且可以在具有数十万行的数据集上执行?
我已经通过打开循环中的数据帧来解决最初的问题:
import distance
import pandas as pd
#Create test Dataframes
df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
results=[]
#Create two generators objects to loop through each dataframe row one at the time
#Call each dataframe element that you want to have in the final output in the loop
#Append results to the empty list you created
for a,b,c in df1.itertuples():
for d,e,f in df2.itertuples():
results.append((a,b,c,distance.jaccard(c, f),e,d,f))
result_df=pd.DataFrame(results)
print(result_df)
我相信您需要的是TarName
和RefName
的笛卡尔乘积。将距离函数应用于产品是您需要的结果。
df1["mergekey"] = 0
df2["mergekey"] = 0
df_merged = pd.merge(df1, df2, on = "mergekey")
df_merged["Distance"] = df_merged.apply(lambda x: distance.jaccard(x.RefName, x.TarName), axis = 1)
结果:
NameId_x RefName Type_x mergekey NameId_y TarName Type_y Distance
0 1 robert johnes Person 0 1 roberto johnes Person 0.000000
1 1 robert johnes Person 0 2 lew malinosky Person 0.705882
2 1 robert johnes Person 0 3 andreatta della blatta Person 0.538462
3 2 lew malinsky Person 0 1 roberto johnes Person 0.764706
4 2 lew malinsky Person 0 2 lew malinosky Person 0.083333
5 2 lew malinsky Person 0 3 andreatta della blatta Person 0.666667
6 3 gioberto delle lanterne Person 0 1 roberto johnes Person 0.533333
7 3 gioberto delle lanterne Person 0 2 lew malinosky Person 0.588235
8 3 gioberto delle lanterne Person 0 3 andreatta della blatta Person 0.250000