如何创建名称相似列?



我有一个示例数据框架

pid = [1,2,3,4,5]; name = ['abc', 'def', 'bca', 'fed', 'pqr']; match_score = [np.nan, np.nan, np.nan, np.nan, np.nan]
sample_df = pd.DataFrame(zip(pid,name,match_score), columns=['pid', 'name', 'match_score'])
sample_df
match_score南

设法应用fuzzywuzzy来获得所需的结果。

我也刚刚开始,所以我的方法很可能不是最好的,但适用于您呈现的数据:

import pandas as pd
import numpy as np
from fuzzywuzzy import process, fuzz
pid = [1,2,3,4,5]; name = ['abc', 'def', 'bca', 'fed', 'pqr']; match_score = [np.nan, np.nan, np.nan, np.nan, np.nan]
sample_df = pd.DataFrame(zip(pid,name,match_score), columns=['pid', 'name', 'match_score'])
sample_df.drop('match_score', axis=1, inplace=True) # droping col as it will be created later.
unique_names = sample_df['name'].unique().tolist()
match_score = [(x,) + i
for x in unique_names 
for i in process.extract(x, unique_names,     scorer=fuzz.token_sort_ratio)]

similarity_df = pd.DataFrame(match_score, columns=['name','name_compare','match_score'])
similarity_df = similarity_df[similarity_df['match_score'] !=0].copy()
similarity_df = similarity_df[similarity_df['match_score'] !=100].drop('name_compare', axis=1)
sample_df= sample_df.merge(similarity_df, left_on='name', right_on='name', how="outer")
sample_df.match_score = sample_df.match_score / 100
print(sample_df)

:

pid name  match_score
0    1  abc         0.67
1    2  def         0.33
2    3  bca         0.67
3    4  fed         0.33
4    5  pqr          NaN

我正在运行两个循环:外部和内部。抱歉不能评论。当我继续注释代码时,我得到了python中缩进的错误。

我设置了最大值0。我将它与计算出的比率值相匹配。如果比较的字符串不相同,我也会匹配。如果检查(字符串比较和值对max)都为真,我使用loc

将其分配给match_score列。
length_df=len(sample_df)
for outer_index in range(0, length_df):
max=0
for inner_index in range(0, length_df):
out_value=sample_df.iloc[outer_index]['name']
inn_value=sample_df.iloc[inner_index]['name']
value_ratio=SequenceMatcher(None,out_value,inn_value).ratio()
if (out_value!=inn_value) & (value_ratio >max):

sample_df.loc[outer_index,'match_score']=value_ratio

##
##
##
##

最新更新