我有一个这样的数据帧:
Id Column Val1 Val2
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0
2 Cust=acc hit,Data=125 3.0 400.0
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0
5 Cust=evc,Region Info=atz 2.0 NaN
我想把数据帧转换成这样:
Id Column Val1 Val2 Cust Region Info Data
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz 123.0
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz 124.0
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN 125.0
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz 126.0
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz 127.0
5 Cust=evc,Region Info=atz 2.0 NaN evc atz NaN
从另一个问题中,我得到了部分答案。
但是如何处理键和值中的空格呢?
编辑:可能有多个键值对(除了示例中显示的键值对(。因此,我需要处理任意"n"列数的事例。
Series.str.findall
我们可以使用带有正则表达式捕获组的str.findall
从Id Column
列中提取key-value
对
df.join(pd.DataFrame(map(dict, df['Id Column'].str.findall(r'([^=,]+)=([^,]+)'))))
Id Column Val1 Val2 Cust Region Info Data
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz 123
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz 124
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN 125
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz 126
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz 127
5 Cust=evc,Region Info=atz 2.0 NaN evc atz NaN
Regex
详细信息
([^=,]+)
:第一个捕获组[^=,]+
:匹配列表[=,]
中不存在的任何字符一次或多次
=
:从字面上匹配=
字符([^,]+)
:第二捕获组[^,]+
:匹配列表[,]
中不存在的任何字符一次或多次
请参阅联机regex demo
只显示您的示例,请尝试以下操作。
import pandas as pd
df[["Cust Region","Info","Data"]] = df["IdColumn"].str.extract(r'^Cust=([^,]+)(?:,Region Info=([^,]*))?(?:,Data=(.*))?$', expand=True)
df
以下是已使用regex 的在线演示
输出如下:
IdColumn Val1 Val2 Cust Region Info Data
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz 123
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz 124
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN 125
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz 126
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz 127
5 Cust=evc,Region Info=atz 2.0 NaN evc atz NaN
解释:添加对上述正则表达式的详细解释。
^Cust= ##Checking if value starts from Cust= here.
([^,]+) ##Creating 1st capturing group which has all values till , here.
(?:,Region Info= ##Starting a non-capturing group , Region Info= here.
([^,]*) ##Creating 2nd capturing group which has all values till , here.
)? ##Closing non-capturing group here.
(?:,Data= ##Creating non-capturing group which has ,Data= here.
(.*) ##Creating 3rd capturing group which has all values till end of value here.
)?$ ##Closing non-capturing group here at the end of line.
在列Id Column
上使用apply()
并通过拆分获得值。
df['Cust Region'] = df['Id Column'].apply(lambda x: x.split(',')[0].split('=')[-1])
# print(df)
Id Column Val1 Val2 Cust Region
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd
2 Cust=acc hit,Data=125 3.0 400.0 acc hit
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss
5 Cust=evc,Region Info=atz 2.0 NaN evc
使用列表理解,先按,
拆分,然后按=
拆分字典列表,因此可以传递给DataFrame
构造函数:
L = [dict([y.split('=') for y in x.split(',')]) for x in df['Id Column']]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
Id Column Val1 Val2 Cust Region Info
0 Cust=abc,Region Info=xyz,Data=123 0.0 NaN abc xyz
1 Cust=abd,Region Info=xyz,Data=124 1.0 750.0 abd xyz
2 Cust=acc hit,Data=125 3.0 400.0 acc hit NaN
3 Cust=abc,Region Info=xyz,Data=126 NaN 200.0 abc xyz
4 Cust=abg nss,Region Info=xaz,Data=127 -1.0 420.0 abg nss xaz
5 Cust=evc,Region Info=atz 2.0 NaN evc atz
Data
0 123
1 124
2 125
3 126
4 127
5 NaN