基于键值panda拆分数据帧列



我有一个这样的数据帧:

Id Column  Val1   Val2
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0
2                  Cust=acc hit,Data=125   3.0  400.0
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0
5               Cust=evc,Region Info=atz   2.0    NaN

我想把数据帧转换成这样:

Id Column  Val1   Val2     Cust Region Info   Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz  123.0
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz  124.0
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN  125.0
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz  126.0
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz  127.0
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz    NaN

从另一个问题中,我得到了部分答案。

但是如何处理键和值中的空格呢?

编辑:可能有多个键值对(除了示例中显示的键值对(。因此,我需要处理任意"n"列数的事例。

Series.str.findall

我们可以使用带有正则表达式捕获组的str.findallId Column列中提取key-value

df.join(pd.DataFrame(map(dict, df['Id Column'].str.findall(r'([^=,]+)=([^,]+)'))))

Id Column  Val1   Val2     Cust Region Info Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz  123
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz  124
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN  125
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz  126
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz  127
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz  NaN

Regex详细信息

  • ([^=,]+):第一个捕获组
    • [^=,]+:匹配列表[=,]中不存在的任何字符一次或多次
  • =:从字面上匹配=字符
  • ([^,]+):第二捕获组
    • [^,]+:匹配列表[,]中不存在的任何字符一次或多次

请参阅联机regex demo

只显示您的示例,请尝试以下操作。

import pandas as pd
df[["Cust Region","Info","Data"]] = df["IdColumn"].str.extract(r'^Cust=([^,]+)(?:,Region Info=([^,]*))?(?:,Data=(.*))?$', expand=True)
df

以下是已使用regex 的在线演示

输出如下:

IdColumn  Val1   Val2 Cust Region Info Data
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN         abc  xyz  123
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0         abd  xyz  124
2                  Cust=acc hit,Data=125   3.0  400.0     acc hit  NaN  125
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0         abc  xyz  126
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0     abg nss  xaz  127
5               Cust=evc,Region Info=atz   2.0    NaN         evc  atz  NaN

解释:添加对上述正则表达式的详细解释。

^Cust=              ##Checking if value starts from Cust= here.
([^,]+)             ##Creating 1st capturing group which has all values till , here.
(?:,Region Info=    ##Starting a non-capturing group , Region Info= here.
([^,]*)           ##Creating 2nd capturing group which has all values till , here.
)?                  ##Closing non-capturing group here.
(?:,Data=           ##Creating non-capturing group which has ,Data= here.
(.*)              ##Creating 3rd capturing group which has all values till end of value here.
)?$                 ##Closing non-capturing group here at the end of line.

在列Id Column上使用apply()并通过拆分获得值。

df['Cust Region'] = df['Id Column'].apply(lambda x: x.split(',')[0].split('=')[-1])
# print(df)
Id Column  Val1   Val2 Cust Region
0      Cust=abc,Region Info=xyz,Data=123  0.0     NaN         abc
1      Cust=abd,Region Info=xyz,Data=124  1.0   750.0         abd
2                  Cust=acc hit,Data=125  3.0   400.0     acc hit
3      Cust=abc,Region Info=xyz,Data=126  NaN   200.0         abc
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0     abg nss
5               Cust=evc,Region Info=atz  2.0     NaN         evc

使用列表理解,先按,拆分,然后按=拆分字典列表,因此可以传递给DataFrame构造函数:

L = [dict([y.split('=') for y in x.split(',')]) for x in df['Id Column']]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
Id Column  Val1   Val2     Cust Region Info  
0      Cust=abc,Region Info=xyz,Data=123   0.0    NaN      abc         xyz   
1      Cust=abd,Region Info=xyz,Data=124   1.0  750.0      abd         xyz   
2                  Cust=acc hit,Data=125   3.0  400.0  acc hit         NaN   
3      Cust=abc,Region Info=xyz,Data=126   NaN  200.0      abc         xyz   
4  Cust=abg nss,Region Info=xaz,Data=127  -1.0  420.0  abg nss         xaz   
5               Cust=evc,Region Info=atz   2.0    NaN      evc         atz   
Data  
0  123  
1  124  
2  125  
3  126  
4  127  
5  NaN  

最新更新