请帮我做下面的操作。我想这是我此刻在熊猫身上需要做的最特别的事情。
Basicaly我需要合并两个数据帧,其中在df1中我有一个部分字符串(address_id(,在数据帧2中我有相同的信息,但与另一个连接(concat_address_id(。
我尝试了几种方法来合并、提取字符串、预处理字符串、检查包含部分字符串匹配的列表。但是,请不要像下面的示例那样找到一种聪明的方法来完成我需要的操作,即基于子字符串匹配的合并数据帧。
这是df1:
process sku qty address_id customer country
process1 sku1 1 address1 customer5 BR
process1 sku2 1 address2 customer5 BR
process1 sku3 1 address3 customer5 BR
process1 sku4 1 address4 customer5 BR
process1 sku5 1 address5 customer5 BR
这是df2。
concat_address_id last_login country_of_login
address1address5 15/10/2020 CN
address6address2 18/02/2020 NL
address3address5 13/05/2019 BR
address6address4 18/06/2020 NL
address5address8 13/05/2019 RU
这就是预期的结果。
预期结果:
process sku qty address_id customer country last_login country_of_login
process1 sku1 1 address1 customer5 BR 15/10/2020 CN
process1 sku2 1 address2 customer5 BR 18/02/2020 NL
process1 sku3 1 address3 customer5 BR 13/05/2019 BR
process1 sku4 1 address4 customer5 BR 18/06/2020 NL
process1 sku5 1 address5 customer5 BR 13/05/2019 RU
这应该也能正常工作
# Split concat_address_id column with reg expression
df2['address_id_1'] = 'address' + df2['concat_address_id'].str.split('address').str.get(1)
df2['address_id_2'] = 'address' + df2['concat_address_id'].str.split('address').str.get(2)
# Create empty address_id column to merge with df1
df2['address_id'] = ''
# Filter out address id missing from df1
df2.loc[~df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_2']
# Set value in address_id column
df2.loc[df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_1']
concat_address_id last_login country_of_login address_id_1 address_id_2 address_id
0 address1address5 15/10/2020 CN address1 address5 address1
1 address6address2 18/02/2020 NL address6 address2 address2
2 address3address5 13/05/2019 BR address3 address5 address3
3 address6address4 18/06/2020 NL address6 address4 address4
4 address5address8 13/05/2019 RU address5 address8 address5
# Merge df1 and df2
df_final = pd.merge(df1,df2[['address_id', 'last_login', 'country_of_login']],
on='address_id',how='left')
process sku address_id customer country last_login country_of_login
0 process1 sku1 address1 customer5 BR 15/10/2020 CN
1 process1 sku2 address2 customer5 BR 18/02/2020 NL
2 process1 sku3 address3 customer5 BR 13/05/2019 BR
3 process1 sku4 address4 customer5 BR 18/06/2020 NL
4 process1 sku5 address5 customer5 BR 13/05/2019 RU
基于此:如何在字符串中合并熊猫?
>>> df1
process sku address_id customer country
0 process1 sku1 address1 customer5 BR
1 process1 sku2 address2 customer5 BR
2 process1 sku3 address3 customer5 BR
3 process1 sku4 address4 customer5 BR
4 process1 sku5 address5 customer5 BR
>>> df2
concat_address_id last_login customer country_of_login
0 address1address5 15/10/2020 customer5 CN
1 address6address2 18/2/2020 customer5 NL
2 address3address5 13/05/2019 customer5 BR
3 address6address4 18/06/2020 customer5 NL
4 address5address8 13/05/2019 customer5 RU
>>> check = [(process, sku, address_id, customer, country, cust, last_login, country_li) for i, (process, sku, address_id, customer, country) in df1.iterrows() for j, (concat_addr, last_login, cust, country_li) in df2.iterrows() if address_id in concat_addr]
>>> check
[('process1', 'sku1', 'address1', 'customer5', 'BR', 'customer5', '15/10/2020', 'CN'), ('process1', 'sku2', 'address2', 'customer5', 'BR', 'customer5', '18/2/2020', 'NL'), ('process1', 'sku3', 'address3', 'customer5', 'BR', 'customer5', '13/05/2019', 'BR'), ('process1', 'sku4', 'address4', 'customer5', 'BR', 'customer5', '18/06/2020', 'NL'), ('process1', 'sku5', 'address5', 'customer5', 'BR', 'customer5', '15/10/2020', 'CN'), ('process1', 'sku5', 'address5', 'customer5', 'BR', 'customer5', '13/05/2019', 'BR'), ('process1', 'sku5', 'address5', 'customer5', 'BR', 'customer5', '13/05/2019', 'RU')]
>>> (pd.DataFrame(check, columns=["process", "sku", "address_id", "customer", "country", "customer", "last_login", "country_login"]))
process sku address_id customer country customer last_login country_login
0 process1 sku1 address1 customer5 BR customer5 15/10/2020 CN
1 process1 sku2 address2 customer5 BR customer5 18/2/2020 NL
2 process1 sku3 address3 customer5 BR customer5 13/05/2019 BR
3 process1 sku4 address4 customer5 BR customer5 18/06/2020 NL
4 process1 sku5 address5 customer5 BR customer5 15/10/2020 CN
5 process1 sku5 address5 customer5 BR customer5 13/05/2019 BR
6 process1 sku5 address5 customer5 BR customer5 13/05/2019 RU
I have redundant customer so it can be removed!
如果有帮助,请告诉我。