我有一个需要转换的DataFrame
。将每年的a, b, c, and d
列从rows
转换为columns
。df
是转换前的数据,df1
是转换后的数据。哪些语句可以优化?我的代码如下:
import numpy as np
import pandas as pd
np.random.seed(2021)
df = pd.DataFrame()
years = np.arange(2020, 2016, -1)
df['year'] = years
df['a'] = 1
t1 = pd.DataFrame()
t1['code'] = np.arange(1, 1001)
t1['a'] = 1
df = pd.merge(df, t1, how='outer')
columns = ['a', 'b', 'c', 'd']
df[columns] = np.random.rand(len(df) * len(columns)).reshape(len(df), len(columns)) * 100
df = (df[['code', 'year'] + columns]
.sort_values(by=['code', 'year'], ascending=[True, False])
.reset_index(drop=True))
print('nOriginal DataFrame df:')
print(df.head(10))
t1 = df.drop_duplicates('code')['code']
for i in years:
t2 = df[df.year == i].rename(columns={'a': 'a' + str(i), 'b': 'b' + str(i), 'c': 'c' + str(i), 'd': 'd' + str(i)})
t2 = t2.drop(['year'], axis=1)
t1 = pd.merge(t1, t2)
columns2 = ['code', 'a2020', 'a2019', 'a2018', 'a2017', 'b2020', 'b2019', 'b2018', 'b2017', 'c2020', 'c2019', 'c2018',
'c2017', 'd2020', 'd2019', 'd2018', 'd2017']
df1 = t1[columns2].sort_values(by='code').reset_index(drop=True)
print('nDesired result DataFrame df1:')
print(df1.head(10))
原始DataFrame df
:
code year a b c d
0 1 2020 60.597828 73.336936 13.894716 31.267308
1 1 2019 13.789299 13.785302 42.329889 26.746788
2 1 2018 20.275643 17.800622 11.384761 48.751234
3 1 2017 84.178312 53.173128 55.714519 85.602826
4 2 2020 99.724328 12.816238 17.899311 75.292543
5 2 2019 84.299843 12.070816 84.099255 88.540943
6 2 2018 68.374138 54.314583 55.619371 78.855774
7 2 2017 92.724593 31.877149 17.871666 9.245337
8 3 2020 66.216051 78.431013 9.689440 5.857129
9 3 2019 30.026980 68.526491 76.161394 14.701888
期望结果DataFrame df1
:
code a2020 a2019 ... d2019 d2018 d2017
0 1 60.597828 13.789299 ... 26.746788 48.751234 85.602826
1 2 99.724328 84.299843 ... 88.540943 78.855774 9.245337
2 3 66.216051 30.026980 ... 14.701888 80.382850 93.540599
3 4 96.239599 19.826018 ... 11.576210 62.692749 60.357877
4 5 61.652471 63.574622 ... 89.279561 47.126554 23.663626
5 6 45.214524 23.944341 ... 67.501414 87.742089 93.640579
6 7 58.370402 49.246463 ... 40.886723 49.001126 14.725770
7 8 73.222503 37.440411 ... 76.945162 53.845775 96.251071
8 9 82.801437 4.264280 ... 57.847433 45.442707 70.354386
9 10 30.106862 91.145614 ... 36.387508 39.346212 85.077131
您可以使用pivot
:
result = df.pivot(index="code", values = ["a", "b", "c", "d"], columns = "year")
# flatten and rename your columns if necessary:
result.columns = result.columns.to_flat_index().map(lambda x: "".join([x[0],str(x[1])]))
a2017 a2018 a2019 a2020 ... d2017 d2018 d2019 d2020
code ...
1 54.280967 52.179967 50.494392 8.822997 ... 10.077397 47.295947 81.042424 81.797062
2 88.683948 28.406555 92.679674 62.223895 ... 29.423042 56.582891 20.835779 46.707585
3 NaN NaN 9.467721 34.100070 ... NaN NaN 35.135736 4.749653