如何计算两个数据帧中所有行之间的列文施泰因距离并输出每对的列文施泰因分数?



我正在尝试计算两个数据帧(dfa和dfb(之间的Levenshtein距离,如下所示。

发文局:

Name      Addresss     ID  
Name1a    Address1a    ID1a
Name2a    Address2a    ID2a

DFB:

Name      Addresss      ID  
Name1b    Address1b   ID1b
Name2b    Address2b   ID2b

我理解计算两个字符串之间的距离,但我有点困惑如何对另一个列进行一组列,输出看起来像这样,它显示所有对和分数:

输出:

Name      Name      LevScore
Name1a    Name1b       0.87
Name1a    Name2b       0.45
Name1a    Name3b       0.26
Name2a    Name1b       0.92
Name2a    Name2b       0.67
Name2a    Name3b       0.56
etc

提前感谢!

马内什

您可以将包Levenshteinitertools一起使用,以获取两列的值组合:

import Levenshtein as lev
from itertools import product
new_df = pd.DataFrame(product(df1['Name'], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.score(x[0],x[1]), axis=1)
print(new_df)
Name1   Name2   LevScore
0   Name1a  Name1b  1
1   Name1a  Name2b  2
2   Name2a  Name1b  2
3   Name2a  Name2b  1

编辑

假设这是您的 df1:

df1_n = pd.concat([df1,df1,df1]).reset_index(drop=True)
df1_n
Name    Addresss    ID
0   Name1a  Address1a   ID1a
1   Name2a  Address2a   ID2a
2   Name1a  Address1a   ID1a
3   Name2a  Address2a   ID2a
4   Name1a  Address1a   ID1a
5   Name2a  Address2a   ID2a

正如您所说,您可以从df1_n中计算值的组合,以获取大小step块:

fina_df = pd.DataFrame()
step=2
for i in range(0,df1_n.shape[0],step):
new_df = pd.DataFrame(product(df1_n.iloc[i:i+step,0], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.distance(x[0],x[1]), axis=1)
fina_df = pd.concat([fina_df, new_df], axis=0).reset_index(drop=True)
print(final_df)

输出:

Name1   Name2   LevScore
0   Name1a  Name1b  1
1   Name1a  Name2b  2
2   Name2a  Name1b  2
3   Name2a  Name2b  1
4   Name1a  Name1b  1
5   Name1a  Name2b  2
6   Name2a  Name1b  2
7   Name2a  Name2b  1
8   Name1a  Name1b  1
9   Name1a  Name2b  2
10  Name2a  Name1b  2
11  Name2a  Name2b  1

根据您的情况,将 2 更改为 300 或 500。这应该避免填满您的整个 RAM,让我知道它是否有效!

试试这个:

import pandas as pd
from textdistance import levenshtein
from itertools import product
# dfa = pd.read_clipboard()  # this is just to reproduce your dataframe
# dfb = pd.read_clipboard()  # this is just to reproduce your dataframe
dfc = pd.DataFrame(product(dfa['Name'], dfb['Name']), columns=['Name1', 'Name2'])
dfc['Distance'] = dfc.apply(lambda x: levenshtein.distance(x['Name1'],
x['Name2']), axis=1)
Name1   Name2  Distance
0  Name1a  Name1b         1
1  Name1a  Name2b         2
2  Name2a  Name1b         2
3  Name2a  Name2b         1

最新更新