我有两个CSV文件。一个包含供应商数据,另一个包含员工数据。类似于";模糊查找";在excel中,我希望进行两种类型的匹配,并输出两个csv文件中的所有列,包括一个新列作为每行的相似度。在excel中,我会使用0.80的阈值。下面是示例数据,我的实际数据在其中一个文件中有200万行,如果用excel完成,这将是一场噩梦。
输出1: 来自供应商文件,模糊匹配"供应商名称";用";员工姓名";来自员工档案。显示两个文件中的所有列和相似度的新列
输出2: 来自供应商文件,模糊匹配"SSN";用";SSN";来自员工档案。显示两个文件中的所有列和相似度的新列
这是两个独立的输出
数据帧1:供应商数据
公司 |
| 供应商名称发票号 | >交易金额供应商类型 | ||
---|---|---|---|---|---|
15 | 58421 | CLIFFORD BROWN854 | 500 | 其他||
150 | 9675 | 绿色7412 | >70 | >一次性 | 774801971 |
200 | 15789 | 史密斯,约翰 | |||
200 | 69997 | >HAROON,SIMAN | 964 | 100 | 其他741-98-7821 |
为了水平连接两个DataFrame,我将Employees DataFrame与匹配的供应商名称的索引对齐。如果没有匹配供应商名称,我只放一个空行。
更多详细信息:
- 我迭代了供应商名称,对于每个供应商名称,我将得分最高的员工名称的索引添加到索引列表中。请注意,我为每个供应商名称最多添加了一个匹配的员工记录
- 如果找不到匹配项(得分过低),我会将手动添加到Employees Dataframe的空记录的索引添加到其中
- 然后,此索引列表用于对Employees DataDrame进行重新排序
- 最后,我只是将两个DataFrame水平合并。请注意,此时的两个DataFrame不必具有相同的大小,但在这种情况下,
concat
方法只是通过将缺失的行附加到较小的DataFrame来填补空白
代码如下:
import numpy as np
import pandas as pd
from thefuzz import process as fuzzy_process # the new repository of fuzzywuzzy
# import dataframes
...
# adding empty row
employees_df = employees_df.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(employees_df) - 1
# matching between vendor and employee names
indexed_employee_names_dict = dict(enumerate(employees_df["Employee Name"]))
matched_employees = set()
ordered_employees = []
scores = []
for vendor_name in vendors_df["Vendor Name"]:
match = fuzzy_process.extractOne(
query=vendor_name,
choices=indexed_employee_names_dict,
score_cutoff=80
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_employees.add(index)
ordered_employees.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_employees = [i for i in range(len(employees_df)) if i not in matched_employees]
ordered_employees.extend(missing_employees)
ordered_employees_df = employees_df.iloc[ordered_employees].reset_index()
merged_df = pd.concat([vendors_df, ordered_employees_df], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_employees))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
merged_df = merged_df.sort_values("Similarity Ratio", ascending=False)
对于根据SSN列的匹配,可以用完全相同的方式完成,只需替换上面代码中的列名。此外,该过程可以概括为一个接受DataFrames和列名的函数:
def match_and_merge(df1: pd.DataFrame, df2: pd.DataFrame, col1: str, col2: str, cutoff: int = 80):
# adding empty row
df2 = df2.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(df2) - 1
# matching between vendor and employee names
indexed_strings_dict = dict(enumerate(df2[col2]))
matched_indices = set()
ordered_indices = []
scores = []
for s1 in df1[col1]:
match = fuzzy_process.extractOne(
query=s1,
choices=indexed_strings_dict,
score_cutoff=cutoff
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_indices.add(index)
ordered_indices.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_indices = [i for i in range(len(df2)) if i not in matched_indices]
ordered_indices.extend(missing_indices)
ordered_df2 = df2.iloc[ordered_indices].reset_index()
# merge rows of dataframes
merged_df = pd.concat([df1, ordered_df2], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_indices))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
return merged_df.sort_values("Similarity Ratio", ascending=False)
if __name__ == "__main__":
vendors_df = pd.read_excel(r'DirectorySample Vendor Data.xlsx')
employees_df = pd.read_excel(r'DirectorySample Workday Data.xlsx')
merged_df = match_and_merge(vendors_df, employees_df, "Vendor Name", "Employee Name")
merged_df.to_excel("merged_by_names.xlsx", index=False)
merged_df = match_and_merge(vendors_df, employees_df, "SSN", "SSN")
merged_df.to_excel("merged_by_ssn.xlsx", index=False)
上面的代码产生了以下输出:
merged_by_names.xlsx
供应商名称