如何在熊猫数据帧的目标列和参考列上应用模糊匹配函数



******使用以下解决方案编辑*******

我已经仔细阅读了指导方针,希望这个问题是可以接受的。

我有两个Panda数据帧,我需要在目标列和参考列上应用模糊匹配函数,并根据相似性得分合并数据,保留原始数据。

我检查过类似的问题,例如,请参阅:
有可能用python pandas进行模糊匹配合并吗?但我无法使用此解决方案。到目前为止,我有:

df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
import distance
fulldf=[]
for name1 in  df1['RefName']:
    for name2 in df2['TarName']:
        if distance.jaccard(name1, name2)<0.6:
            fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)
  1. 如何在最终输出中包括"NameId"one_answers"Type"(以及最终的其他列(,例如:

    df1_NameId  RefName         df1_Type    df1_NewColumn   Score   df2_NameId  TarName         df2_Type    df2_NewColumn
       1        robert johnes   Person           …          0.0000     1        roberto johnes  Person          …
    
  2. 有没有一种方法可以对其进行编码,使其易于扩展,并且可以在具有数十万行的数据集上执行?


我已经通过打开循环中的数据帧来解决最初的问题:

import distance
import pandas as pd
#Create test Dataframes
df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew       malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
results=[]
#Create two generators objects to loop through each dataframe row one at the time
#Call each dataframe element that you want to have in the final output in the loop
#Append results to the empty list you created
for a,b,c in df1.itertuples():
    for d,e,f in df2.itertuples():
        results.append((a,b,c,distance.jaccard(c, f),e,d,f))
result_df=pd.DataFrame(results)
print(result_df)

我相信您需要的是TarNameRefName的笛卡尔乘积。将距离函数应用于产品是您需要的结果。

df1["mergekey"] = 0
df2["mergekey"] = 0
df_merged = pd.merge(df1, df2, on = "mergekey")
df_merged["Distance"] = df_merged.apply(lambda x: distance.jaccard(x.RefName, x.TarName), axis = 1)

结果:

    NameId_x    RefName Type_x  mergekey    NameId_y    TarName Type_y  Distance
0   1   robert johnes   Person  0   1   roberto johnes  Person  0.000000
1   1   robert johnes   Person  0   2   lew malinosky   Person  0.705882
2   1   robert johnes   Person  0   3   andreatta della blatta  Person  0.538462
3   2   lew malinsky    Person  0   1   roberto johnes  Person  0.764706
4   2   lew malinsky    Person  0   2   lew malinosky   Person  0.083333
5   2   lew malinsky    Person  0   3   andreatta della blatta  Person  0.666667
6   3   gioberto delle lanterne Person  0   1   roberto johnes  Person  0.533333
7   3   gioberto delle lanterne Person  0   2   lew malinosky   Person  0.588235
8   3   gioberto delle lanterne Person  0   3   andreatta della blatta  Person  0.250000

最新更新