基于部分子字符串匹配合并数据帧



请帮我做下面的操作。我想这是我此刻在熊猫身上需要做的最特别的事情。

Basicaly我需要合并两个数据帧,其中在df1中我有一个部分字符串(address_id(,在数据帧2中我有相同的信息,但与另一个连接(concat_address_id(。

我尝试了几种方法来合并、提取字符串、预处理字符串、检查包含部分字符串匹配的列表。但是,请不要像下面的示例那样找到一种聪明的方法来完成我需要的操作,即基于子字符串匹配的合并数据帧。

这是df1:

process     sku    qty  address_id  customer    country
process1    sku1    1   address1    customer5   BR
process1    sku2    1   address2    customer5   BR
process1    sku3    1   address3    customer5   BR
process1    sku4    1   address4    customer5   BR
process1    sku5    1   address5    customer5   BR

这是df2。

concat_address_id   last_login  country_of_login
address1address5    15/10/2020  CN
address6address2    18/02/2020  NL
address3address5    13/05/2019  BR
address6address4    18/06/2020  NL
address5address8    13/05/2019  RU

这就是预期的结果。

预期结果:

process        sku  qty address_id  customer     country    last_login  country_of_login
process1    sku1    1   address1    customer5   BR  15/10/2020  CN
process1    sku2    1   address2    customer5   BR  18/02/2020  NL
process1    sku3    1   address3    customer5   BR  13/05/2019  BR
process1    sku4    1   address4    customer5   BR  18/06/2020  NL
process1    sku5    1   address5    customer5   BR  13/05/2019  RU

这应该也能正常工作

# Split concat_address_id column with reg expression
df2['address_id_1'] = 'address' + df2['concat_address_id'].str.split('address').str.get(1)
df2['address_id_2'] = 'address' + df2['concat_address_id'].str.split('address').str.get(2)
# Create empty address_id column to merge with df1
df2['address_id'] = ''
# Filter out address id missing from df1
df2.loc[~df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_2']
# Set value in address_id column 
df2.loc[df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_1']
concat_address_id   last_login  country_of_login    address_id_1    address_id_2    address_id
0   address1address5    15/10/2020  CN                  address1    address5    address1
1   address6address2    18/02/2020  NL                  address6    address2    address2
2   address3address5    13/05/2019  BR                  address3    address5    address3
3   address6address4    18/06/2020  NL                  address6    address4    address4
4   address5address8    13/05/2019  RU                  address5    address8    address5
# Merge df1 and df2
df_final = pd.merge(df1,df2[['address_id', 'last_login', 'country_of_login']],
on='address_id',how='left')
process     sku     address_id  customer    country last_login  country_of_login
0   process1    sku1    address1    customer5   BR      15/10/2020  CN
1   process1    sku2    address2    customer5   BR      18/02/2020  NL
2   process1    sku3    address3    customer5   BR      13/05/2019  BR
3   process1    sku4    address4    customer5   BR      18/06/2020  NL
4   process1    sku5    address5    customer5   BR      13/05/2019  RU

基于此:如何在字符串中合并熊猫?

>>> df1
process   sku address_id   customer country
0  process1  sku1   address1  customer5      BR
1  process1  sku2   address2  customer5      BR
2  process1  sku3   address3  customer5      BR
3  process1  sku4   address4  customer5      BR
4  process1  sku5   address5  customer5      BR
>>> df2
concat_address_id  last_login   customer country_of_login
0  address1address5  15/10/2020  customer5               CN
1  address6address2   18/2/2020  customer5               NL
2  address3address5  13/05/2019  customer5               BR
3  address6address4  18/06/2020  customer5               NL
4  address5address8  13/05/2019  customer5               RU
>>> check = [(process, sku, address_id, customer, country, cust, last_login, country_li) for i, (process, sku, address_id, customer, country) in df1.iterrows() for j, (concat_addr, last_login, cust, country_li) in df2.iterrows() if address_id in concat_addr]
>>> check
[('process1', 'sku1', 'address1', 'customer5', 'BR', 'customer5', '15/10/2020', 'CN'), ('process1', 'sku2', 'address2', 'customer5', 'BR', 'customer5', '18/2/2020', 'NL'), ('process1', 'sku3', 'address3', 'customer5', 'BR', 'customer5', '13/05/2019', 'BR'), ('process1', 'sku4', 'address4', 'customer5', 'BR', 'customer5', '18/06/2020', 'NL'), ('process1', 'sku5', 'address5', 'customer5', 'BR', 'customer5', '15/10/2020', 'CN'), ('process1', 'sku5', 'address5', 'customer5', 'BR', 'customer5', '13/05/2019', 'BR'), ('process1', 'sku5', 'address5', 'customer5', 'BR', 'customer5', '13/05/2019', 'RU')]

>>> (pd.DataFrame(check, columns=["process", "sku", "address_id", "customer", "country", "customer", "last_login", "country_login"]))
process   sku address_id   customer country   customer  last_login country_login
0  process1  sku1   address1  customer5      BR  customer5  15/10/2020            CN
1  process1  sku2   address2  customer5      BR  customer5   18/2/2020            NL
2  process1  sku3   address3  customer5      BR  customer5  13/05/2019            BR
3  process1  sku4   address4  customer5      BR  customer5  18/06/2020            NL
4  process1  sku5   address5  customer5      BR  customer5  15/10/2020            CN
5  process1  sku5   address5  customer5      BR  customer5  13/05/2019            BR
6  process1  sku5   address5  customer5      BR  customer5  13/05/2019            RU

I have redundant customer so it can be removed!

如果有帮助,请告诉我。

相关内容

  • 没有找到相关文章