我正在尝试扩展pandas数据帧的一列(请参阅下面示例中的列段。(我能够把它分解成由分隔开的组件;但是,正如您所看到的,列中的一些行不具有所有元素。所以,现在发生的是应该进入Geo列的数据最终进入BusSeg纵队,因为没有Geo纵队;或数据应该在ProdServ列中的,最终在Geo列中。理想情况下,我希望只有数据,而不是指标在正确放置的每个单元格中。所以在地理栏中,应该写"非我们"。不是"Geo=NonUs"那是在正确分隔之后,我想删除文本每个中包含'='符号。我该怎么做?以下代码:
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
print("ndf1:")
df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True)
print(df1)
print(df1[['BusSeg','Geo','ProdServ','Sub','Misc']])
print(df1.dtypes)
print()
您的数据
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
df:
company clv date line segments
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2
3 Rev 400 20181231 1 Subseg=Tr1
4 Rev 10 20181231 3 BusSeg=Pharma
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4;
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs;
在代码中注释这一行df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True)
,并添加这两行
d = pd.DataFrame(df1['segments'].str.split(';').apply(lambda x:{i.split("=")[0] : i.split("=")[1] for i in x if i}).to_dict()).T
df = pd.concat([df1, d], axis=1)
df:
company clv date line segments BusSeg Geo Prd Subseg
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1 Pharma NonUs Alpha Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1 Dev NaN Alpha Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2 Pharma US Alpha Tr2
3 Rev 400 20181231 1 Subseg=Tr1 NaN NaN NaN Tr1
4 Rev 10 20181231 3 BusSeg=Pharma Pharma NaN NaN NaN
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4; NaN China Alpha Tr4
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1 NaN NaN Beta Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1; Pharma US Delta Tr1
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs; Pharma NonUs NaN NaN
我建议,要逐个填充列,而不是使用拆分,代码如下:
col = ['BusSeg', 'Geo', 'ProdServ', 'Sub'] # Columns Names.
var = ['BusSeg', 'Geo', 'Prd', 'Subseg'] # Variables Name in 'Subseg' column.
for c, v in zip(col, var):
df1[c] = df1['segments'].str.extract(r'' + v + '=(w*);')
这里有一个建议:
df1.segments = (df1.segments.str.split(';')
.apply(lambda s:
dict(t.split('=') for t in s if t.strip() != '')))
df2 = pd.DataFrame({col: [dict_.get(col, '') for dict_ in df1.segments]
for col in set().union(*df1.segments)},
index=df1.index)
df1.drop(columns=['segments'], inplace=True)
df1 = pd.concat([df1, df2], axis='columns')
结果:
company clv date line Subseg Geo BusSeg Prd
0 Rev 500 20191231 1 Tr1 NonUs Pharma Alpha
1 Rev 200 20191231 3 Tr1 Dev Alpha
2 Rev 3000 20191231 2 Tr2 US Pharma Alpha
3 Rev 400 20181231 1 Tr1
4 Rev 10 20181231 3 Pharma
5 Rev 300 20181231 2 Tr4 China Alpha
6 Rev 560 20171231 1 Tr1 Beta
7 Rev 500 20171231 3 Tr1 US Pharma Delta
8 Rev 600 20171231 2 NonUs Pharma