我正在尝试计算两个数据帧(dfa和dfb(之间的Levenshtein距离,如下所示。
发文局:
Name Addresss ID
Name1a Address1a ID1a
Name2a Address2a ID2a
DFB:
Name Addresss ID
Name1b Address1b ID1b
Name2b Address2b ID2b
我理解计算两个字符串之间的距离,但我有点困惑如何对另一个列进行一组列,输出看起来像这样,它显示所有对和分数:
输出:
Name Name LevScore
Name1a Name1b 0.87
Name1a Name2b 0.45
Name1a Name3b 0.26
Name2a Name1b 0.92
Name2a Name2b 0.67
Name2a Name3b 0.56
etc
提前感谢!
马内什
您可以将包Levenshtein
与itertools
一起使用,以获取两列的值组合:
import Levenshtein as lev
from itertools import product
new_df = pd.DataFrame(product(df1['Name'], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.score(x[0],x[1]), axis=1)
print(new_df)
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
编辑
假设这是您的 df1:
df1_n = pd.concat([df1,df1,df1]).reset_index(drop=True)
df1_n
Name Addresss ID
0 Name1a Address1a ID1a
1 Name2a Address2a ID2a
2 Name1a Address1a ID1a
3 Name2a Address2a ID2a
4 Name1a Address1a ID1a
5 Name2a Address2a ID2a
正如您所说,您可以从df1_n
中计算值的组合,以获取大小step
块:
fina_df = pd.DataFrame()
step=2
for i in range(0,df1_n.shape[0],step):
new_df = pd.DataFrame(product(df1_n.iloc[i:i+step,0], df2['Name']), columns=["Name1","Name2"])
new_df["LevScore"] = new_df.apply(lambda x: lev.distance(x[0],x[1]), axis=1)
fina_df = pd.concat([fina_df, new_df], axis=0).reset_index(drop=True)
print(final_df)
输出:
Name1 Name2 LevScore
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1
4 Name1a Name1b 1
5 Name1a Name2b 2
6 Name2a Name1b 2
7 Name2a Name2b 1
8 Name1a Name1b 1
9 Name1a Name2b 2
10 Name2a Name1b 2
11 Name2a Name2b 1
根据您的情况,将 2 更改为 300 或 500。这应该避免填满您的整个 RAM,让我知道它是否有效!
试试这个:
import pandas as pd
from textdistance import levenshtein
from itertools import product
# dfa = pd.read_clipboard() # this is just to reproduce your dataframe
# dfb = pd.read_clipboard() # this is just to reproduce your dataframe
dfc = pd.DataFrame(product(dfa['Name'], dfb['Name']), columns=['Name1', 'Name2'])
dfc['Distance'] = dfc.apply(lambda x: levenshtein.distance(x['Name1'],
x['Name2']), axis=1)
Name1 Name2 Distance
0 Name1a Name1b 1
1 Name1a Name2b 2
2 Name2a Name1b 2
3 Name2a Name2b 1