熊猫模糊匹配的多重处理



我有两个数据帧。具有347k个不同地址的DF_Address和具有24k个具有的记录的DF_Project

Project_Id、Project_Start_Date和Project_Address

我想检查我的Project_Address在Df_Address中是否存在模糊匹配。如果匹配,我想提取相同的Project_ID和Project_Start_Date。下面是我正在尝试的代码

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Df_Address = pd.read_csv("Cantractor_Addresses.csv")
Df_Project = pd.read_csv("Project_info.csv")
#address = list(Df_Project["Project_Address"])
def fuzzy_match(x, choices, cutoff):
print(x)
return process.extractOne(
x, choices=choices, score_cutoff=cutoff
)
Matched = Df_Address ["Address"].apply(
fuzzy_match,
args=(
Df_Project ["Project_Address"], 
80
)
)

此代码确实以元组的形式提供了输出

('matched_string',得分(

但它也给出了类似的字符串。我还需要提取

Project_Id和Project_Start_Date

。有人能帮助我使用并行处理来实现这一点吗?因为数据量很大。

您可以将元组转换为数据帧,然后连接到基本数据帧。

import pandas as pd
Df_Address = pd.DataFrame({'address': ['abc','cdf'],'random_stuff':[100,200]})
Matched = (('abc',10),('cdf',20))
dist = pd.DataFrame(x)
dist.columns = ['address','distance']
final = Df_Address.merge(dist,how='left',on='address')
print(final)

输出:

address  random_stuff  distance
0     abc           100        10
1     cdf           200        20

最新更新