我的任务是重新组织一个相当大的数据集进行分析。我想制作一个数据框架,每个员工都有一个与他们的员工编号相关的统计信息列表,根据他们在公司的时间排序。数据并没有追溯到公司成立之初,因此一些员工不会出现在第一期。我的猜测是,有一些枢轴和合并的组合,我无法理解。
df1看起来像这样:
Periods since Start Period Employee Number Wage Sick Days
0 3 202001 101 20 14
1 2 202001 102 15 12
2 1 202001 103 10 17
3 4 202002 101 20 14
4 3 202002 102 20 10
5 2 202002 103 10 13
6 5 202003 101 25 13
7 4 202003 102 20 9
8 3 202003 103 10 13
我想要df2(列#仅供参考(:
Column1 Column2 Column3 Column4 Column5
101 102 103
1 Wage NaN NaN 10
1 Sick Days NaN NaN 17
2 Wage NaN 15 10
2 Sick Days NaN 12 13
3 Wage 20 20 10
3 Sick Days 14 10 13
4 Wage 20 20 NaN
4 Sick Days 14 9 NaN
第1列="自开始以来的时段">
第2列=";Stat";例如"工资"、"病假">
第3列-第5列标题="员工编号">
最初的想法是尝试pivot/merge/stack,但我没有得到好的结果。
我想到的第二个选项是用我想要的索引和头创建一个数据帧,然后从df1填充它
import pandas as pd
import numpy as np
stat_list = ['Wage', 'Sick Days']
largest_period = df1['Periods since Start'].max()
df2 = np.tile(stat_list, largest_period)
df2 = pd.DataFrame(data=df2, columns = ['Stat'])
df2['Period_Number'] = df2.groupby('Stat').cumcount()+1
df2 = pd.DataFrame(index = df2[['Period_Number', 'Stat']],
columns = df1['Employee Number'])
收益率:
Employee Number 101 102 103
(1, 'Wage') NaN NaN NaN
(1, 'Sick Days') NaN NaN NaN
(2, 'Wage') NaN NaN NaN
(2, 'Sick Days') NaN NaN NaN
(3, 'Wage') NaN NaN NaN
(3, 'Sick Days') NaN NaN NaN
(4, 'Wage') NaN NaN NaN
(4, 'Sick Days') NaN NaN NaN
但我不知道如何填充它。
- 您可以先
.melt
,然后.unstack
数据帧 - 最后,使用
.droplevel
并传递axis=1
来清除列上不必要的级别,而不是默认的axis=0
,这将删除索引列。您还可以使用reset_index()
将索引列引入数据帧:
df = (df.melt(id_vars=['Periods since Start', 'Employee Number'],
value_vars=['Wage', 'Sick Days'])
.set_index(['Periods since Start', 'Employee Number', 'variable']).unstack(1)
.droplevel(0, axis=1)
.reset_index())
df
Out[1]:
Employee Number Periods since Start variable 101 102 103
0 1 Sick Days NaN NaN 17.0
1 1 Wage NaN NaN 10.0
2 2 Sick Days NaN 12.0 13.0
3 2 Wage NaN 15.0 10.0
4 3 Sick Days 14.0 10.0 13.0
5 3 Wage 20.0 20.0 10.0
6 4 Sick Days 14.0 9.0 NaN
7 4 Wage 20.0 20.0 NaN
8 5 Sick Days 13.0 NaN NaN
9 5 Wage 25.0 NaN NaN
在融化数据帧时,可以传递var_name=
,因为默认值为"0";变量";。如果这样做,请确保在使用set_index()
时也更改列名。
试试这个方法,首先melt
数据帧将Periods since Start、Employee Number和Period保存在索引中。接下来,pivot
数据帧通过熔化数据帧中的值来生成具有"值"的行和列。最后,使用reset_index
清除索引,并使用rename_axis
:删除列索引标头名称
df.melt(['Periods since Start', 'Employee Number', 'Period'])
.pivot(['Periods since Start', 'variable'], 'Employee Number', 'value')
.reset_index()
.rename_axis(None, axis=1)
输出:
Periods since Start variable 101 102 103
0 1 Sick Days NaN NaN 17.0
1 1 Wage NaN NaN 10.0
2 2 Sick Days NaN 12.0 13.0
3 2 Wage NaN 15.0 10.0
4 3 Sick Days 14.0 10.0 13.0
5 3 Wage 20.0 20.0 10.0
6 4 Sick Days 14.0 9.0 NaN
7 4 Wage 20.0 20.0 NaN
8 5 Sick Days 13.0 NaN NaN
9 5 Wage 25.0 NaN NaN