我有一个脚本,可以读取.xlsx文件并创建一个看起来像的数据帧
index|TASK|CODE|NAME|WT|ST|ORIGIN|SRV|DESTINY|FT|MCLINE|ST.1|ORIGIN.1|SRV.1|DESTINY.1|FT.1|MCLINE.1
这可能会更长,具体取决于excel文件的列,并且只重复字段ST.(n)
、ORIGIN.(n)
、SRV.(n)
、DESTINY.(n)
、FT.(n)
、MCLINE.(n))
例如
index | TASK | CODE | NAME | >WT | ST | >ORIGIN | SRV | DESTINY="text align=right;">MCLINE.1 | ST.2 | ORIGIN.2 | SRV.2 | DESTINY.2 | FT.2 | >|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 61P00QH | >td style="text align:right;">12900CROUCH,彼得· | 06:14 | 14:46 | >Pat Col61P004T | at Col6:06>Etap 1-R*G0431 | ||||||||
1 | 61P00CH | 10900 | >td style="ext-align:left;">LAMPARD,FRANK07:13 | 06:20 | >Pat Col1:33 | Pat列 | 61P00CT | at列14:13 | Etapa 1-R*D0431 | /td>"text align:right;"> | ||||
2 | 5SE00DH | 18049 | >td style="ext-align:left;">GERRARD,史蒂文07:30 | 11:55 | >td style="ext-align:left;">Grand StationSE005O>Grand Station16:41 | Grand StationD0290/CopaD0291 | //tr>
这是pandas.wide_to_long
的经典情况。但我们需要先做一点调整,因为这个函数需要所有类似的列都以相同的模式命名,即<COLNAME.N>
import pandas as pd
# import your data into a dataframe df
common_cols = ['ST', 'ORIGIN', 'SRV', 'DESTINY', 'FT', 'MCLINE']
df = df.rename({col: col + '.0' for col in common_cols}, axis=1)
_df = (pd.wide_to_long(df, stubnames=common_cols,
i=['TASK', 'CODE', 'NAME', 'WT'],
j='n',
suffix=r'.d*')
.reset_index()
.drop('n', axis=1)
.dropna())
结果如下:
TASK CODE NAME WT ST ORIGIN SRV DESTINY FT MCLINE
0 61P00QH 12900 CROUCH,PETER 06:14 14:46 PatCol 61P004T PatCol 16:06 Etap1-R*G0431
3 61P00CH 10900 LAMPARD,FRANK 07:13 06:20 PatCol 61P00CT PatCol 09:53 Etap1-R*D0431
4 61P00CH 10900 LAMPARD,FRANK 07:13 10:33 PatCol 61P00CT PatCol 14:13 Etapa1-R*D0431
6 5SE00DH 18049 GERRARD,STEVEN 07:30 11:55 GrandStation 5SE005O GrandStation 16:01 Grandstation*D0290/Copa*D0291
7 5SE00DH 18049 GERRARD,STEVEN 07:30 16:41 GrandStation 5SE003O GrandStation 17:37 No
8 5SE00DH 18049 GERRARD,STEVEN 07:30 17:41 GrandStation 5SE009O PatOda 19:55 GrandStation*D0290/Copa*D0291