我想连接panda中的两个数据帧,但由于数据不完美/不干净,可能会有不同的列提供密钥



我有两个数据集要连接,但数据很脏,没有一列可以作为完美的键。

df_supplier = pd.DataFrame({
"Company":["Company A", "Company B", "Company C"], 
"Supplier_Website":["www.companya.ie", "www.companyb.ie", "www.companyc.ie"], 
"Supplier_Telephones":["047 72451", "01 298 8330", "042 971 7900 087 222 2222"]})

df_installer = pd.DataFrame({
"Company":["Company A", "Company-B", "Company-C"], 
"Installer_Name":["jack", "peter", "helen"], 
"Surname":["himself", "himself", "herself"], 
"Email":["jack@companya.ie", "peter@companyb.ie", "helen@gmail.com"], 
"Installer_Telephone":["0123 456 789", "0123 789 456", "0156 234 789"], 
"Installer_Mobile":["Number1", "Number2", "087 222 2222"]})

在上面的数据帧中,"公司"列并不总是完全匹配;B公司"Company-B";。请注意,这些差异可能是任何东西,缺少字母/空格/大写字母,数据质量非常糟糕。还有两种方法可以合并/连接数据帧。

  1. 如果网站的域=电子邮件地址的域(对于公司B)
  2. 如果";安装程序移动;处于";Supplier_ Telephones";柱

最终,我想要的是以下

df_combined = pd.DataFrame({
"Supplier_Company":["Company A", "Company B", "Company C"], 
"Supplier_Website":["www.companya.ie", "www.companyb.ie", "www.companyc.ie"], 
"Supplier_Telephones":["047 72451", "01 298 8330", "042 971 7900 087 222 2222"],
"Installer_Name":["jack", "peter", "helen"], 
"Surname":["himself", "himself", "herself"], 
"Email":["jack@companya.ie", "peter@companyb.ie", "helen@gmail.com"], 
"Installer_Telephone":["0123 456 789", "0123 789 456", "0156 234 789"], 
"Installer_Mobile":["Number1", "Number2", "087 222 2222"]})

所以这是一种IF(公司匹配OR域名匹配OR手机号码isin),然后合并),但如果我能找到最好的方法,我会很危险。

我假设公司名称、域名和电话号码对于任何给定的公司都是唯一的。

我想手动添加最终需要的列,然后在df_supplier上迭代,在df_installer中进行最多三次搜索,然后当在任何地方找到匹配时,将值复制到df_combined中。

但我觉得这不是蟒蛇的感觉(只是一种感觉)。

感谢收到任何想法(或完美工作的代码)/Colm

我的建议是使用fuzzywuzzy模块进行非精确字符串匹配。然后您可以得到df_supplier['Company']df_installer['Company']中每个条目的匹配分数。这样你就会得到一个矩形矩阵。您认为可能匹配的所有列都是一样的。你会得到一组大小相同的矩阵。现在,您可以添加这些矩阵,并为每个条目找到最大匹配值。这将是你最匹配的候选人。现在是棘手的部分。之后设置了一些阈值。如果最大值超过它,你很可能有一个匹配。这很棘手,因为正确的阈值取决于;脏的";你的数据是。

您可以对一些匹配矩阵进行加权。例如,如果你的电话号码完全匹配,那么将得到的分数乘以(比如)10。这样一来,确定的匹配将覆盖任何其他候选者。但对于数字来说,非精确字符串比较是不起作用的,所以你必须对数据进行预处理(比如删除所有非数字),只寻找完全匹配的数据。

对于我们的工作,我们将使用一些辅助数据帧,仅用于我们想要的列,在域的情况下,使用已经到位的替换:

df_supplier_aux = pd.DataFrame({'supplier_index':df_supplier.index,'company':df_supplier.Company,'domain':df_supplier.Supplier_Website.str.replace('www.',''),'phones':df_supplier.Supplier_Telephones})
df_installer_aux = pd.DataFrame({'installer_index':df_installer.index,'company':df_installer.Company,'domain':df_installer.Email.str.replace('.*@',''),'phone':df_installer.Installer_Mobile})

之后,我们将建立我们的三个joiner,它们将保存索引,并帮助我们最终加入所有内容——它们是公司、域和电话joiner。

df_company_joiner = pd.merge(df_supplier_aux,df_installer_aux,on='company')[['supplier_index','installer_index']]
df_domain_joiner = pd.merge(df_supplier_aux,df_installer_aux,on='domain')[['supplier_index','installer_index']]
# for phone we need an intermediary dataframe you can use any of the following lines (second one was commented out)
temp = df_installer_aux.phone.apply(lambda phone:df_supplier_aux.phones.str.contains(phone))
#temp = df_supplier_aux.phones.apply(lambda phones:df_installer_aux.phone.apply(lambda phone:phone in phones)).T
df_phone_joiner = temp.unstack()[temp.unstack()].reset_index()[['level_0','level_1']].rename(columns={'level_0':'supplier_index','level_1':'installer_index'})

电话加入者使用的临时数据帧基本上是一个按安装者维度的供应商数据帧,其中有一堆truesfalses,告诉我们哪些供应商的电话列表中有安装者的手机。创建后,它会根据自身进行过滤以忽略假值,其余的真值将被取消堆栈。将二维的供应商按安装程序数组转换为供应商按安装器索引的列表是一种技巧。注意:需要注意的一个细节是,安装程序的小电话号码有一定的敏感性,如果安装程序的电话意外地只有0,则所有带零的电话都将匹配。

然后您将构建统一的master_joiner。但要做到这一点,你将做出决定。它有任何优先顺序吗?例如,如果安装程序已经通过公司加入者与供应商匹配,那么它是否仍然应该通过域然后通过电话加入,冒着被分配给另一个供应商的风险?这可能是可取的,也可能是不可取的,这取决于情况。

使用优先级:

# filtering domain from installers assinged via company
df_domain_joiner_filtered = df_domain_joiner[~df_domain_joiner.installer_index.isin(df_company_joiner.installer_index)]
# filtering phone from installers assinged via company and via domain
df_phone_joiner_filtered = df_phone_joiner[~df_phone_joiner.installer_index.isin(df_company_joiner.installer_index)&~df_phone_joiner.installer_index.isin(df_domain_joiner.installer_index)]
# we're building a master joiner using the filtered joiners
df_master_joiner = pd.concat([df_company_joiner,df_domain_joiner_filtered,df_phone_joiner_filtered])

不使用优先级:

df_master_joiner = pd.concat([df_company_joiner,df_domain_joiner,df_phone_joiner]).drop_duplicates()

而finnaly,最后一次加盟操作,从供应商端开始:

df_merged = df_supplier.merge(df_master_joiner,left_index=True,right_on='supplier_index').merge(df_installer,left_on='installer_index',right_index=True)

我们先是和主木匠一起加入了供应商,然后又和供应商一起加入了。

最后,您可以去掉两个辅助列:

del df_merged['supplier_index']
del df_merged['installer_index']

我在代码中添加了注释,以帮助您了解我是如何进行数据清理和匹配数据的。

正如@igrinis和@Ukraineserge所提到的,你可以尝试使用fuzzywuzzy模块来进行匹配。然而,您也可以尝试通过剥离不重要的字符并匹配重要的字符来实现这一点。对于电话号码,您只需要检查该号码是否是供应商电话号码的一部分。要进行检查,请去掉所有非数字字符。然后使用子字符串匹配进行检查。

以下是我为获得供应商和安装程序DataFrames之间匹配的数据而采取的步骤。

  1. 步骤1:设置一个空数据帧以存储最终结果

  2. 步骤2:清理供应商和安装程序DataFrames 中的数据

  3. 第三步:匹配公司名称。从中排除匹配记录安装程序数据帧

  4. 第四步:域名匹配。从安装程序中排除匹配记录DataFrame

  5. 步骤5:匹配安装程序的电话号码。排除匹配记录来自安装程序DataFrame

  6. 步骤6:匹配安装程序的手机号码。排除匹配记录来自安装程序DataFrame

  7. 步骤7:现在你已经匹配了公司名称,域名,删除安装程序电话号码和安装程序手机号码供应商和安装商DataFrames 中不需要的列

  8. 步骤8:打印最终匹配的DataFrame和Final的结果不匹配的DataFrame

完整代码如下所示:

import pandas as pd
pd.set_option('display.max_columns', None)
df_supplier = pd.DataFrame({
"Company":["Company A", "Company X", "Company Y", "Company D", "Company E"], 
"Supplier_Website":["www.companya.ie", "www.companyb.ie", "www.companyc.ie", "www.companyd.ie", "www.companye.ie"], 
"Supplier_Telephones":["047-72451", "(01).298.8330", "042 971 7900 087 222 2222", "047 12354", "(01).287.8335"]})

df_installer = pd.DataFrame({
"Company":["Company A", "Company-B", "Company-C", "Company-Z", "Company-F"], 
"Installer_Name":["jack", "peter", "helen", "susan","emily"], 
"Surname":["himself", "himself", "herself", "herself","herself"], 
"Email":["jack@companya.com", "peter@companyb.ie", "helen@gmail.com", "susan@yahoo.com","emily@herself.ie"], 
"Installer_Telephone":["0123 456 789", "0123 789 456", "0156 234 789", "047 123 54", "012 345 678"], 
"Installer_Mobile":["Number1", "Number2", "087 222 2222", "Number4","012 345 678"]})
print ('nThe original Supplier DataFrame is .... n')
print (df_supplier)
print ('nThe original Installer DataFrame is ..... n')
print (df_installer)
#Step 1: Setup an empty Dataframe to store the final results
#rename company name as per final requirement
df_supplier.rename(columns={'Company': 'Supplier_Company'}, inplace=True)
df_installer.rename(columns={'Company': 'Installer_Company'}, inplace=True)
#Create an empty final dataframe with the required columns
final_cols = ['Supplier_Company','Supplier_Website','Supplier_Telephones', 
'Installer_Name','Surname','Installer_Telephone','Installer_Mobile']
df_final = pd.DataFrame(columns=final_cols)
#Step 2: Cleanse the data in the Supplier and Installer DataFrames
#Step 2a: Cleanse Supplier DataFrame
#strip out space,-,(,),. from supplier company name
df_supplier['csname'] = df_supplier.Supplier_Company.replace(to_replace=r'( |-|(|)|.)', value='', regex=True)
#strip out space,-,(,),. and leading 0 from phone numbers
df_supplier['sphone'] = df_supplier.Supplier_Telephones.replace(to_replace=r'( |-|(|)|.)', value='', regex=True).replace(to_replace=r'(^0)', value='', regex=True)
#if you want to remove the .com or .ie from the website
#df_supplier['domain'] = df_supplier.Supplier_Website.replace(to_replace=r'^www.', value='', regex=True).replace(to_replace=r'.w+$', value='', regex=True)
df_supplier['domain'] = df_supplier.Supplier_Website.replace(to_replace=r'^www.', value='', regex=True)
#Step 2b: Cleanse Installer DataFrame
#strip out space,-,(,),. from installer company name
df_installer['csname'] = df_installer.Installer_Company.replace(to_replace=r'( |-|(|)|.)', value='', regex=True)
#strip out space,-,(,),. and leading 0 from phone numbers
df_installer['tphone'] = df_installer.Installer_Telephone.replace(to_replace=r'( |-|(|)|.)', value='', regex=True).replace(to_replace=r'(^0)', value='', regex=True)
df_installer['mphone'] = df_installer.Installer_Mobile.replace(to_replace=r'( |-|(|)|.)', value='', regex=True).replace(to_replace=r'(^0)', value='', regex=True)
#if you want to remove the .com or .ie from the email id
#df_installer['domain'] = df_installer.Email.str.split('@',n=1,expand=True)[1].replace(to_replace=r'.w+$', value='', regex=True)
df_installer['domain'] = df_installer.Email.str.split('@',n=1,expand=True)[1]
#Step 3: Match on Company Name. Exclude matches records from Installer DataFrame
df_merged_temp = pd.concat([df_supplier.set_index('csname'),
df_installer.set_index('csname')],
axis=1,join='inner').reset_index()
#exclude all records in df_installer if found in df_merged_temp
df_installer = df_installer[~(df_installer['csname'].isin(df_merged_temp['csname']))].dropna(how='all')
#Concatenate the matched records to df_final DataFrame
df_temp = df_merged_temp[final_cols]
df_final = pd.concat([df_final,df_temp])
#Step 4: Match on Domain Name. Exclude matches records from Installer DataFrame
df_merged_temp = pd.concat([df_supplier.set_index('domain'),
df_installer.set_index('domain')],
axis=1,join='inner').reset_index()
#exclude all records in df_installer if found in df_merged_temp
df_installer = df_installer[~(df_installer['domain'].isin(df_merged_temp['domain']))].dropna(how='all')
#Concatenate the matched records to df_final DataFrame
df_temp = df_merged_temp[final_cols]
df_final = pd.concat([df_final,df_temp])

#Step 5: Match on Installer Telephone Number. Exclude matches records from Installer DataFrame
#Create a list of all Installer Telephone Numbers to check against Supplier Phone Number
tphones = df_installer.tphone.unique().tolist()
#Match records where Supplier Phone Number is part of Installer Telephone Number
df_supplier['tphone'] = df_supplier['sphone'].apply(lambda x: ''.join([part for part in tphones if part in x]))
df_merged_temp = pd.concat([df_supplier.set_index('tphone'),
df_installer.set_index('tphone')],
axis=1,join='inner').reset_index()
#exclude all records in df_installer if found in df_merged_temp
df_installer = df_installer[~(df_installer['tphone'].isin(df_merged_temp['tphone']))].dropna(how='all')
#Concatenate the matched records to df_final DataFrame
df_domain = df_merged_temp[final_cols]
df_final = pd.concat([df_final,df_domain])
#Step 6: Match on Installer Mobile Number. Exclude matches records from Installer DataFrame
#Create a list of all Installer Mobile Numbers to check against Supplier Phone Number
mphones = df_installer.mphone.unique().tolist()
#Match records where Supplier Phone Number is part of Installer Mobile Number
df_supplier['mphone'] = df_supplier['sphone'].apply(lambda x: ''.join([part for part in mphones if part in x]))
df_merged_temp = pd.concat([df_supplier.set_index('mphone'),
df_installer.set_index('mphone')],
axis=1,join='inner').reset_index()
#exclude all records in df_installer if found in df_merged_temp
df_installer = df_installer[~(df_installer['mphone'].isin(df_merged_temp['mphone']))].dropna(how='all')
#Concatenate the matched records to df_final DataFrame
df_domain = df_merged_temp[final_cols]
df_final = pd.concat([df_final,df_domain])
#Step 7: Now that you have matched against Company Name, Domain Name,
#Installer Telephone Number, and Installer Mobile Number, remove unwanted columns
#from Supplier and Installer DataFrames
#Also delete df_merged_temp dataframe
del df_merged_temp
df_supplier.drop(columns=['domain','mphone','tphone','csname','sphone'],inplace=True)
df_installer.drop(columns=['domain','mphone','tphone','csname'],inplace=True)
df_installer.rename(columns={'Installer_Company':'Company'}, inplace=True)
#Step 8: Print the results of the Final matched DataFrame, and Final unmatched DataFrame
print ('nThe Final matched DataFrame is ..... n')
print (df_final)
print ('nThe Final unmatched Installer DataFrame is .... n')
print (df_installer)

其输出为:

原始供应商DataFrame是。。。。

Company Supplier_Website        Supplier_Telephones
0  Company A  www.companya.ie                  047-72451
1  Company X  www.companyb.ie              (01).298.8330
2  Company Y  www.companyc.ie  042 971 7900 087 222 2222
3  Company D  www.companyd.ie                  047 12354
4  Company E  www.companye.ie              (01).287.8335

原始安装程序数据帧是。。。。。

Company Installer_Name  Surname              Email Installer_Telephone  
0  Company A           jack  himself  jack@companya.com        0123 456 789   
1  Company-B          peter  himself  peter@companyb.ie        0123 789 456   
2  Company-C          helen  herself    helen@gmail.com        0156 234 789   
3  Company-Z          susan  herself    susan@yahoo.com          047 123 54   
4  Company-F          emily  herself   emily@herself.ie         012 345 678   
Installer_Mobile  
0          Number1  
1          Number2  
2     087 222 2222  
3          Number4  
4      012 345 678  

最终匹配的数据帧是。。。。。

Supplier_Company Supplier_Website        Supplier_Telephones Installer_Name  
0        Company A  www.companya.ie                  047-72451           jack   
0        Company X  www.companyb.ie              (01).298.8330          peter   
0        Company D  www.companyd.ie                  047 12354          susan   
0        Company Y  www.companyc.ie  042 971 7900 087 222 2222          helen   
Surname Installer_Telephone Installer_Mobile  
0  himself        0123 456 789          Number1  
0  himself        0123 789 456          Number2  
0  herself          047 123 54          Number4  
0  herself        0156 234 789     087 222 2222  

最终不匹配的安装程序数据帧是。。。。

Company Installer_Name  Surname             Email Installer_Telephone  
4  Company-F          emily  herself  emily@herself.ie         012 345 678   
Installer_Mobile  
4      012 345 678