我想知道如何" aggregate ";这个转置的数据集。我不确定聚合是否是正确的词,因为数学是跨行发生的。我有一个类似于这样的数据框架:
编辑:"date."中有多个相同的值。数据被转换为人员ID。也有日期1-5的专栏。下表中引用的日期是我最终对创建的nrev1 - nrev#值进行聚合的日期。
我们可以做一些重塑与pd.wide_to_long
然后pivot_table
回到宽幅格式。这允许我们对齐Return和Rev行,然后将Return值转换为新列。可以用add_prefix
和rename_axis
进行一些清理以抛光输出:
new_df = (
pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev', fill_value=0)
.add_prefix('NRev')
.rename_axis(columns=None)
.reset_index()
)
new_df
:
Date NRev0 NRev1 NRev2 NRev3 NRev4 NRev5 NRev6 NRev7 NRev8 NRev9
0 2020-1 100 500 100 200 300 0 0 0 0 0
1 2020-2 0 0 0 0 0 200 120 100 200 0
2 2020-3 0 0 100 0 0 0 0 100 0 200
wide_to_long
给出:
Return Rev
Date K
2020-1 1 0.0 100.0 # Corresponding Return index and Rev are in the same row
2020-2 1 5.0 200.0
2020-3 1 2.0 100.0
2020-1 2 1.0 500.0
2020-2 2 6.0 120.0
2020-3 2 3.0 0.0
2020-1 3 2.0 100.0
2020-2 3 7.0 100.0
2020-3 3 7.0 100.0
2020-1 4 3.0 200.0
2020-2 4 8.0 200.0
2020-3 4 9.0 200.0
2020-1 5 4.0 300.0
2020-2 5 NaN NaN
2020-3 5 NaN NaN # These NaN are Not Needed
移除NaN步骤并返回Return to int
(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int}))
Return Rev
Date K
2020-1 1 0 100.0
2020-2 1 5 200.0
2020-3 1 2 100.0
2020-1 2 1 500.0
2020-2 2 6 120.0
2020-3 2 3 0.0
2020-1 3 2 100.0
2020-2 3 7 100.0
2020-3 3 7 100.0
2020-1 4 3 200.0
2020-2 4 8 200.0
2020-3 4 9 200.0
2020-1 5 4 300.0
然后可以很容易地用pivot_table
:
(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev', fill_value=0))
Return 0 1 2 3 4 5 6 7 8 9
Date
2020-1 100 500 100 200 300 0 0 0 0 0
2020-2 0 0 0 0 0 200 120 100 200 0
2020-3 0 0 100 0 0 0 0 100 0 200
其余的只是对DataFrame的修饰性更改。
如果日期重复,则不能使用wide_to_long
,但我们可以手动重塑DataFrame以使用str.extract
然后set_index
+stack
:
# Set Index Column
new_df = df.set_index('Date')
# Handle MultiIndex Manually
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extract('(.*)(d+)$')
)
# Stack then the rest is the same
new_df = (
new_df.stack()
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev',
fill_value=0, aggfunc='first')
.add_prefix('NRev')
.rename_axis(columns=None)
.reset_index()
)
日期重复的DF样本:
df = pd.DataFrame({'Date': ['2020-1', '2020-2', '2020-2'],
'Return1': [0, 5, 0],
'Return2': [1, 6, 1],
'Return3': [2, 7, 2],
'Return4': [3, 8, 3],
'Return5': [4.0, nan, 4.0],
'Rev1': [100, 200, 100],
'Rev2': [500, 120, 0],
'Rev3': [100, 100, 100],
'Rev4': [200, 200, 200],
'Rev5': [300.0, nan, nan]})
df
Date Return1 Return2 Return3 Return4 Return5 Rev1 Rev2 Rev3 Rev4 Rev5
0 2020-1 0 1 2 3 4.0 100 500 100 200 300.0
1 2020-2 5 6 7 8 NaN 200 120 100 200 NaN
2 2020-2 0 1 2 3 4.0 100 0 100 200 NaN
new_df
Date NRev0 NRev1 NRev2 NRev3 NRev4 NRev5 NRev6 NRev7 NRev8
0 2020-1 100 500 100 200 300 0 0 0 0
1 2020-2 100 0 100 200 0 200 120 100 200