重新组织pandas数据帧,将Column转换为新的Header,将原始Header作为具有现有Column的多索引的一



我的任务是重新组织一个相当大的数据集进行分析。我想制作一个数据框架,每个员工都有一个与他们的员工编号相关的统计信息列表,根据他们在公司的时间排序。数据并没有追溯到公司成立之初,因此一些员工不会出现在第一期。我的猜测是,有一些枢轴和合并的组合,我无法理解。

df1看起来像这样:

Periods since Start   Period   Employee Number   Wage   Sick Days  
0                     3   202001               101     20          14  
1                     2   202001               102     15          12  
2                     1   202001               103     10          17  
3                     4   202002               101     20          14  
4                     3   202002               102     20          10  
5                     2   202002               103     10          13  
6                     5   202003               101     25          13  
7                     4   202003               102     20           9  
8                     3   202003               103     10          13  

我想要df2(列#仅供参考(:

Column1   Column2     Column3   Column4   Column5  
101       102       103      
1   Wage        NaN       NaN       10       
1   Sick Days   NaN       NaN       17       
2   Wage        NaN       15        10       
2   Sick Days   NaN       12        13       
3   Wage        20        20        10       
3   Sick Days   14        10        13       
4   Wage        20        20        NaN      
4   Sick Days   14        9         NaN      

第1列="自开始以来的时段">

第2列=";Stat";例如"工资"、"病假">

第3列-第5列标题="员工编号">

最初的想法是尝试pivot/merge/stack,但我没有得到好的结果。

我想到的第二个选项是用我想要的索引和头创建一个数据帧,然后从df1填充它

import pandas as pd
import numpy as np
stat_list = ['Wage', 'Sick Days']
largest_period = df1['Periods since Start'].max()
df2 = np.tile(stat_list, largest_period)
df2 = pd.DataFrame(data=df2, columns = ['Stat']) 
df2['Period_Number'] = df2.groupby('Stat').cumcount()+1
df2 = pd.DataFrame(index = df2[['Period_Number', 'Stat']], 
columns = df1['Employee Number'])

收益率:

Employee Number    101   102   103   
(1, 'Wage')        NaN   NaN   NaN  
(1, 'Sick Days')   NaN   NaN   NaN  
(2, 'Wage')        NaN   NaN   NaN  
(2, 'Sick Days')   NaN   NaN   NaN  
(3, 'Wage')        NaN   NaN   NaN  
(3, 'Sick Days')   NaN   NaN   NaN  
(4, 'Wage')        NaN   NaN   NaN  
(4, 'Sick Days')   NaN   NaN   NaN  

但我不知道如何填充它。

  1. 您可以先.melt,然后.unstack数据帧
  2. 最后,使用.droplevel并传递axis=1来清除列上不必要的级别,而不是默认的axis=0,这将删除索引列。您还可以使用reset_index()将索引列引入数据帧:

df = (df.melt(id_vars=['Periods since Start', 'Employee Number'],
value_vars=['Wage', 'Sick Days'])
.set_index(['Periods since Start', 'Employee Number', 'variable']).unstack(1)
.droplevel(0, axis=1)
.reset_index())
df
Out[1]: 
Employee Number  Periods since Start   variable   101   102   103
0                                  1  Sick Days   NaN   NaN  17.0
1                                  1       Wage   NaN   NaN  10.0
2                                  2  Sick Days   NaN  12.0  13.0
3                                  2       Wage   NaN  15.0  10.0
4                                  3  Sick Days  14.0  10.0  13.0
5                                  3       Wage  20.0  20.0  10.0
6                                  4  Sick Days  14.0   9.0   NaN
7                                  4       Wage  20.0  20.0   NaN
8                                  5  Sick Days  13.0   NaN   NaN
9                                  5       Wage  25.0   NaN   NaN

在融化数据帧时,可以传递var_name=,因为默认值为"0";变量";。如果这样做,请确保在使用set_index()时也更改列名。

试试这个方法,首先melt数据帧将Periods since Start、Employee Number和Period保存在索引中。接下来,pivot数据帧通过熔化数据帧中的值来生成具有"值"的行和列。最后,使用reset_index清除索引,并使用rename_axis:删除列索引标头名称

df.melt(['Periods since Start', 'Employee Number', 'Period'])
.pivot(['Periods since Start', 'variable'], 'Employee Number', 'value')
.reset_index()
.rename_axis(None, axis=1)

输出:

Periods since Start   variable   101   102   103
0                    1  Sick Days   NaN   NaN  17.0
1                    1       Wage   NaN   NaN  10.0
2                    2  Sick Days   NaN  12.0  13.0
3                    2       Wage   NaN  15.0  10.0
4                    3  Sick Days  14.0  10.0  13.0
5                    3       Wage  20.0  20.0  10.0
6                    4  Sick Days  14.0   9.0   NaN
7                    4       Wage  20.0  20.0   NaN
8                    5  Sick Days  13.0   NaN   NaN
9                    5       Wage  25.0   NaN   NaN

最新更新