熊猫影响快速绒毛匹配结果?



我遇到了瓶颈。Rapidfuzz提供不同的结果字符串得分相似性,如果我运行它在一个熊猫数据框架和如果我运行它自己?为什么地址相似度2和最后一行的结果是不同的?

from rapidfuzz import process, utils, fuzz
import pandas as pd
import numpy as np
address_a = 'high new technology development zones huainan city anhui province china anhui anhui any city'
address_b = 'industrial park of funan city'
test_anui_data = {'Processed Client Name': ['anhui jinhan clothing co ltd'], 'Processed Aruvio Name': ['anhui jinhan clothing co ltd'], 'Processed Client Address': [address_a], 'Processed Aruvio Address': [address_b],  'Name Similarity': [89.2857142857142],  'Address Similarity': [np.nan]}  

# Create DataFrame  
test_anui = pd.DataFrame(test_anui_data)  
test_anui
test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui['Processed Client Address']), str(test_anui['Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

这个错误是由于在应用fuzz时调用了整个列。如果您执行以下操作,即对单个行应用模糊,则会得到相同的结果:

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.at[0,'Processed Client Address']), str(test_anui.at[0,'Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

或者使用.loc

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[0,'Processed Client Address']), str(test_anui.loc[0,'Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

数据框的输出为:

Processed Client Name         Processed Aruvio Name  
0  anhui jinhan clothing co ltd  anhui jinhan clothing co ltd   
Processed Client Address  
0  high new technology development zones huainan ...   
Processed Aruvio Address  Name Similarity  Address Similarity  
0  industrial park of funan city        89.285714                 NaN   
Address Similarity 2  
0             28.099174  

fuzz.token_sort_ratio(address_a, address_b)的和为28.099173553719012

换句话说,您需要指定打算从哪一行提取字符串。我假设您的数据框由几行组成,这意味着您必须对每一行执行此操作:

for i in len(test_anui):
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[i,'Processed Client Address']), 
str(test_anui.loc[i,'Processed Aruvio Address']))

最新更新