我有一个以下格式的csv文件:
| a | b | 2018 | 2018 | 2019 | 2019 |
| | | jan | feb | jan | feb |
---------------------------------------
| a1 | b1 | 0 | 1 | 2 | 3 |
| a1 | b2 | 4 | 5 | 6 | 7 |
| a2 | b1 | 8 | 9 | 10 | 11 |
| a2 | b2 | 12 | 13 | 14 | 15 |
我想把它读成熊猫DF,然后融化成以下格式:
| a | b | year | month | value |
----------------------------------
| a1 | b1 | 2018 | jan | 0 |
| a1 | b1 | 2018 | feb | 1 |
| a1 | b1 | 2019 | jan | 2 |
| a1 | b1 | 2019 | feb | 3 |
| a1 | b2 | 2018 | jan | 4 |
| a1 | b2 | 2018 | feb | 5 |
| a1 | b2 | 2019 | jan | 6 |
| a1 | b2 | 2019 | feb | 7 |
| a2 | b1 | 2018 | jan | 8 |
| a2 | b1 | 2018 | feb | 9 |
| a2 | b1 | 2019 | jan | 10 |
| a2 | b1 | 2019 | feb | 11 |
| a2 | b2 | 2018 | jan | 12 |
| a2 | b2 | 2018 | feb | 13 |
| a2 | b2 | 2019 | jan | 14 |
| a2 | b2 | 2019 | feb | 15 |
如何实现这一点?
对于纯数据帧,这应该有效:
import pandas as pd
df = pd.DataFrame({
'a': ['a1', 'a1', 'a2', 'a2',],
'b': ['b1', 'b2', 'b2', 'b2',],
'2018 jan': [0, 4, 8, 12],
'2018 feb': [1, 5, 9, 13],
'2019 jan': [2, 6, 10, 14],
'2019 feb': [3, 7, 11, 15],
})
df = df.melt(id_vars=['a', 'b'], var_name='date', value_name='value')
df['date'] = df['date'].str.split(' ')
df['year'] = df['date'].str[0]
df['month'] = df['date'].str[1]
df.drop(columns='date', inplace=True)
输出:
a b value year month
0 a1 b1 0 2018 jan
1 a1 b2 4 2018 jan
2 a2 b2 8 2018 jan
3 a2 b2 12 2018 jan
4 a1 b1 1 2018 feb
5 a1 b2 5 2018 feb
6 a2 b2 9 2018 feb
7 a2 b2 13 2018 feb
8 a1 b1 2 2019 jan
9 a1 b2 6 2019 jan
10 a2 b2 10 2019 jan
11 a2 b2 14 2019 jan
12 a1 b1 3 2019 feb
13 a1 b2 7 2019 feb
14 a2 b2 11 2019 feb
15 a2 b2 15 2019 feb
如果如注释中所述,列中有一些多索引,则可以在此处将其转换为纯数据帧:
df = pd.read_csv('file.csv', header=[0,1])
df.columns = [' '.join(col).strip() for col in df.columns.values]
df.rename(columns={'a Unnamed: 0_level_1': 'a', 'b Unnamed: 1_level_1': 'b'}, inplace=True)
@KOB我的答案通常可以适合任何带有 2 行标题的 CSV 文件,其中部分列仅在第一行,部分列位于第一行和第二行。根据您的问题,此代码将按照要求正确放置所有标头。 读取 CSV 和创建的多索引数据帧时:
df_multiidx = pd.read_csv('two_levels_header_file.csv', header=[0,1])
id_vars = [idv for idv in df_multiidx.columns if 'Unnamed' in idv[1]]
value_vars = [valv for valv in df_multiidx.columns if 'Unnamed' not in valv[1]]
df_multiidx= df_multiidx.melt(id_vars=id_vars, value_vars=value_vars,var_name=['year','month'])
df_multiidx.rename(columns={col_ren:col_ren[0] for col_ren in id_vars})
输出:
a b year month value
0 a1 b1 2018 jan 0
1 a1 b2 2018 jan 4
2 a2 b1 2018 jan 8
3 a2 b2 2018 jan 12
4 a1 b1 2018 feb 1
5 a1 b2 2018 feb 5
6 a2 b1 2018 feb 9
7 a2 b2 2018 feb 13
8 a1 b1 2019 jan 2
9 a1 b2 2019 jan 6
10 a2 b1 2019 jan 10
11 a2 b2 2019 jan 14
12 a1 b1 2019 feb 3
13 a1 b2 2019 feb 7
14 a2 b1 2019 feb 11
15 a2 b2 2019 feb 15