根据自定义逻辑使用PySpark/Pandas重塑数据框架



我有一个结构类似于下面所示的数据框架:

<表类> INP_A INP_B OUTP_A OUTP_B LVL_NUM BTCH_NUM tbody><<tr>0m1b1011m12b12m1b1112m13十三区最m1b1113m21b21m12b12214x1b1025x12b12x1b1126* 13十三区最x12b12227x21b21* 13十三区最32

您可以在修改DataFrame后使用pivot:

df2 = (df
.assign(idx=df.groupby(['BTCH_NUM', 'LVL_NUM']).cumcount(),
LVL_NUM=lambda d: d.groupby(['BTCH_NUM', 'idx']).cumcount().add(1)
)
.pivot(index=['BTCH_NUM', 'idx'], columns='LVL_NUM')
.sort_index(level=1, axis=1, sort_remaining=False)
.pipe(lambda d: d.set_axis(d.columns.map(lambda x: f'{x[0]}_{x[1]}'), axis=1))
)

输出:

INP_A_1 INP_B_1 OUTP_A_1 OUTP_B_1 INP_A_2 INP_B_2 OUTP_A_2 OUTP_B_2 INP_A_3 INP_B_3 OUTP_A_3 OUTP_B_3 INP_A_4 INP_B_4 OUTP_A_4 OUTP_B_4
BTCH_NUM idx                                                                                                                                        
1        0        m1      b1                       m12     b12       m1       b1     m21     b21      m12      b12     NaN     NaN      NaN      NaN
1       m13     b13       m1       b1     NaN     NaN      NaN      NaN     NaN     NaN      NaN      NaN     NaN     NaN      NaN      NaN
2        0        x1      b1                       x12     b12       x1       b1     x13     b13      x12      b12     x21     b21      x13      b13

然后通过切片获得各个子数据帧,例如BTCH_NUM 1:

df2.loc[1]

最新更新