如果在转置数据集中满足条件,如何聚合?



我想知道如何" aggregate ";这个转置的数据集。我不确定聚合是否是正确的词,因为数学是跨行发生的。我有一个类似于这样的数据框架:

编辑:"date."中有多个相同的值。数据被转换为人员ID。也有日期1-5的专栏。下表中引用的日期是我最终对创建的nrev1 - nrev#值进行聚合的日期。

<表类> 日期 Return1 Return2 Return3 Return4 Return5 Rev1 Rev2 Rev3 Rev4 Rev5 tbody><<tr>2020 - 1012341005001002003002020 - 25678南200120100200南2020 - 32379南1000100200南

我们可以做一些重塑与pd.wide_to_long然后pivot_table回到宽幅格式。这允许我们对齐Return和Rev行,然后将Return值转换为新列。可以用add_prefixrename_axis进行一些清理以抛光输出:

new_df = (
pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev', fill_value=0)
.add_prefix('NRev')
.rename_axis(columns=None)
.reset_index()
)

new_df:

Date  NRev0  NRev1  NRev2  NRev3  NRev4  NRev5  NRev6  NRev7  NRev8  NRev9
0  2020-1    100    500    100    200    300      0      0      0      0      0
1  2020-2      0      0      0      0      0    200    120    100    200      0
2  2020-3      0      0    100      0      0      0      0    100      0    200

wide_to_long给出:

Return    Rev
Date   K               
2020-1 1     0.0  100.0  # Corresponding Return index and Rev are in the same row
2020-2 1     5.0  200.0
2020-3 1     2.0  100.0
2020-1 2     1.0  500.0
2020-2 2     6.0  120.0
2020-3 2     3.0    0.0
2020-1 3     2.0  100.0
2020-2 3     7.0  100.0
2020-3 3     7.0  100.0
2020-1 4     3.0  200.0
2020-2 4     8.0  200.0
2020-3 4     9.0  200.0
2020-1 5     4.0  300.0
2020-2 5     NaN    NaN
2020-3 5     NaN    NaN  # These NaN are Not Needed

移除NaN步骤并返回Return to int

(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int}))
Return    Rev
Date   K               
2020-1 1       0  100.0
2020-2 1       5  200.0
2020-3 1       2  100.0
2020-1 2       1  500.0
2020-2 2       6  120.0
2020-3 2       3    0.0
2020-1 3       2  100.0
2020-2 3       7  100.0
2020-3 3       7  100.0
2020-1 4       3  200.0
2020-2 4       8  200.0
2020-3 4       9  200.0
2020-1 5       4  300.0

然后可以很容易地用pivot_table:

将其移回wide
(pd.wide_to_long(df, stubnames=['Return', 'Rev'], i='Date', j='K')
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev', fill_value=0))
Return    0    1    2    3    4    5    6    7    8    9
Date                                                    
2020-1  100  500  100  200  300    0    0    0    0    0
2020-2    0    0    0    0    0  200  120  100  200    0
2020-3    0    0  100    0    0    0    0  100    0  200

其余的只是对DataFrame的修饰性更改。


如果日期重复,则不能使用wide_to_long,但我们可以手动重塑DataFrame以使用str.extract然后set_index+stack:

# Set Index Column
new_df = df.set_index('Date')
# Handle MultiIndex Manually
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extract('(.*)(d+)$')
)
# Stack then the rest is the same
new_df = (
new_df.stack()
.dropna()
.astype({'Return': int})
.pivot_table(index='Date', columns='Return', values='Rev',
fill_value=0, aggfunc='first')
.add_prefix('NRev')
.rename_axis(columns=None)
.reset_index()
)

日期重复的DF样本:

df = pd.DataFrame({'Date': ['2020-1', '2020-2', '2020-2'],
'Return1': [0, 5, 0],
'Return2': [1, 6, 1],
'Return3': [2, 7, 2],
'Return4': [3, 8, 3],
'Return5': [4.0, nan, 4.0],
'Rev1': [100, 200, 100],
'Rev2': [500, 120, 0],
'Rev3': [100, 100, 100],
'Rev4': [200, 200, 200],
'Rev5': [300.0, nan, nan]})

df

Date  Return1  Return2  Return3  Return4  Return5  Rev1  Rev2  Rev3  Rev4   Rev5
0  2020-1        0        1        2        3      4.0   100   500   100   200  300.0
1  2020-2        5        6        7        8      NaN   200   120   100   200    NaN
2  2020-2        0        1        2        3      4.0   100     0   100   200    NaN

new_df

Date  NRev0  NRev1  NRev2  NRev3  NRev4  NRev5  NRev6  NRev7  NRev8
0  2020-1    100    500    100    200    300      0      0      0      0
1  2020-2    100      0    100    200      0    200    120    100    200

最新更新