当前数据框架:我有一个panda数据框架,每个员工都有一个文本代码(所有代码都以T开头(和代码旁边的相关频率。所有文本代码都有8个字符。
+----------+-------------------------------------------------------------+
| emp_id | text |
+----------+-------------------------------------------------------------+
| E0001 | [T0431516,-8,T0401531,-12,T0517519,12] |
| E0002 | [T0701540,-1,T0431516,-2] |
| E0003 | [T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]|
| E0004 | [T0516319,-3] |
| E0005 | [T0431516,2] |
+----------+-------------------------------------------------------------+
预期的数据框架:我正在尝试将文本代码作为单独的列出现在数据框架中,如果员工有该代码的频率,则填充频率为0。
+----------+----------------------------------------------------------------------------------------+
| emp_id | T0431516 | T0401531 | T0517519 | T0701540 | T0421531 | T0516319 | T0500371 | T0309711 |
+----------+----------------------------------------------------------------------------------------+
| E0001 | -8 | -12 | 12 | 0 | 0 | 0 | 0 | 0 |
| E0002 | -2 | 0 | 0 | -1 | 0 | 0 | 0 | 0 |
| E0003 | 0 | 0 | -1 | 0 | -7 | 9 | -6 | -3 |
| E0004 | 0 | 0 | 0 | 0 | 0 | -3 | 0 | 0 |
| E0005 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------+----------------------------------------------------------------------------------------+
样本数据:
pd.DataFrame({'emp_id' : {0: 'E0001', 1: 'E0002', 2: 'E0003', 3: 'E0004', 4: 'E0005'},
'text' : {0: '[T0431516,-8,T0401531,-12,T0517519,12]', 1: '[T0701540,-1,T0431516,-2]', 2: '[T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]', 3: '[T0516319,-3]', 4: '[T0431516,2]'}
})
到目前为止,我的尝试都没有成功。如有任何建议/帮助,不胜感激!
您可以explode
数据帧,然后创建pivot_table
:
df = pd.DataFrame({'emp_id' : ['E0001', 'E0002', 'E0003', 'E0004', 'E0005'],
'text' : [['T0431516',-8,'T0401531',-12,'T0517519',12],
['T0701540',-1,'T0431516',-2],['T0517519',-1,'T0421531',-7,'T0516319',9,'T0500371',-6,'T0309711',-3],
['T0516319',-3], ['T0431516',2]]})
df = df.explode('text')
df['freq'] = df['text'].shift(-1)
df = df[df['text'].str[0] == 'T']
df['freq'] = df['freq'].astype(int)
df = pd.pivot_table(df, index='emp_id', columns='text', values='freq',aggfunc = 'sum').fillna(0).astype(int)
df
Out[1]:
text T0309711 T0401531 T0421531 T0431516 T0500371 T0516319 T0517519
emp_id
E0001 0 -12 0 -8 0 0 12
E0002 0 0 0 -2 0 0 0
E0003 -3 0 -7 0 -6 9 -1
E0004 0 0 0 0 0 -3 0
E0005 0 0 0 2 0 0 0
text T0701540
emp_id
E0001 0
E0002 -1
E0003 0
E0004 0
E0005 0