Python中的模糊查找

我有两个CSV文件。一个包含供应商数据，另一个包含员工数据。类似于"；模糊查找"；在excel中，我希望进行两种类型的匹配，并输出两个csv文件中的所有列，包括一个新列作为每行的相似度。在excel中，我会使用0.80的阈值。下面是示例数据，我的实际数据在其中一个文件中有200万行，如果用excel完成，这将是一场噩梦。

输出1: 来自供应商文件，模糊匹配"供应商名称"；用"；员工姓名"；来自员工档案。显示两个文件中的所有列和相似度的新列

输出2: 来自供应商文件，模糊匹配"SSN"；用"；SSN"；来自员工档案。显示两个文件中的所有列和相似度的新列

这是两个独立的输出

数据帧1：供应商数据

供应商名称>交易金额CLIFFORD BROWN其他绿色>>>其他

公司	供应商ID	发票号	供应商类型
15	58421	854	500
150	9675	7412	70	一次性	774801971
200	15789	史密斯，约翰
200	69997	HAROON，SIMAN	964	100	741-98-7821

为了水平连接两个DataFrame，我将Employees DataFrame与匹配的供应商名称的索引对齐。如果没有匹配供应商名称，我只放一个空行。

更多详细信息：

我迭代了供应商名称，对于每个供应商名称，我将得分最高的员工名称的索引添加到索引列表中。请注意，我为每个供应商名称最多添加了一个匹配的员工记录
如果找不到匹配项(得分过低)，我会将手动添加到Employees Dataframe的空记录的索引添加到其中
然后，此索引列表用于对Employees DataDrame进行重新排序
最后，我只是将两个DataFrame水平合并。请注意，此时的两个DataFrame不必具有相同的大小，但在这种情况下，concat方法只是通过将缺失的行附加到较小的DataFrame来填补空白

代码如下：


import numpy as np
import pandas as pd
from thefuzz import process as fuzzy_process    # the new repository of fuzzywuzzy
# import dataframes
...
# adding empty row
employees_df = employees_df.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(employees_df) - 1
# matching between vendor and employee names
indexed_employee_names_dict = dict(enumerate(employees_df["Employee Name"]))
matched_employees = set()
ordered_employees = []
scores = []
for vendor_name in vendors_df["Vendor Name"]:
match = fuzzy_process.extractOne(
query=vendor_name,
choices=indexed_employee_names_dict,
score_cutoff=80
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_employees.add(index)
ordered_employees.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_employees = [i for i in range(len(employees_df)) if i not in matched_employees]
ordered_employees.extend(missing_employees)
ordered_employees_df = employees_df.iloc[ordered_employees].reset_index()
merged_df = pd.concat([vendors_df, ordered_employees_df], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_employees))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
merged_df = merged_df.sort_values("Similarity Ratio", ascending=False)

对于根据SSN列的匹配，可以用完全相同的方式完成，只需替换上面代码中的列名。此外，该过程可以概括为一个接受DataFrames和列名的函数：

def match_and_merge(df1: pd.DataFrame, df2: pd.DataFrame, col1: str, col2: str, cutoff: int = 80):
# adding empty row
df2 = df2.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(df2) - 1
# matching between vendor and employee names
indexed_strings_dict = dict(enumerate(df2[col2]))
matched_indices = set()
ordered_indices = []
scores = []
for s1 in df1[col1]:
match = fuzzy_process.extractOne(
query=s1,
choices=indexed_strings_dict,
score_cutoff=cutoff
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_indices.add(index)
ordered_indices.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_indices = [i for i in range(len(df2)) if i not in matched_indices]
ordered_indices.extend(missing_indices)
ordered_df2 = df2.iloc[ordered_indices].reset_index()
# merge rows of dataframes
merged_df = pd.concat([df1, ordered_df2], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_indices))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
return merged_df.sort_values("Similarity Ratio", ascending=False)

if __name__ == "__main__":
vendors_df = pd.read_excel(r'DirectorySample Vendor Data.xlsx')
employees_df = pd.read_excel(r'DirectorySample Workday Data.xlsx')

merged_df = match_and_merge(vendors_df, employees_df, "Vendor Name", "Employee Name")
merged_df.to_excel("merged_by_names.xlsx", index=False)

merged_df = match_and_merge(vendors_df, employees_df, "SSN", "SSN")
merged_df.to_excel("merged_by_ssn.xlsx", index=False)

上面的代码产生了以下输出：

merged_by_names.xlsx

供应商名称索引员工姓名员工ID>经理SN20015789MITH，JOHN>td style="text-align:right；">8040员工>9652148722SMITH，JOHN 3经理31558421克利福德布朗54>td style="text-align:right；">500>其他668419628棕色，CLIFFORD 1经理1200HAROON，SIMON 4管理器4150nannannan蓝色，CITY 2经理2

相关内容

最新更新

热门标签：