使用熊猫变换的多个函数



我有一个看起来像这样的数据集:

entity_id transaction_date transaction_month  net_flow    inflow   outflow
0         51       2018-07-02        2018-07-01  10161.06  20161.06  10000.00
1         51       2018-07-03        2018-07-01   5823.73   5867.37     43.64
2         51       2018-07-05        2018-07-01  17835.79  24107.29   6271.50
3         51       2018-07-06        2018-07-01  -3544.72  31782.84  35327.56
4         51       2018-07-09        2018-07-01  18252.42  18332.42     80.00

我正在尝试使用rollingtransform来计算整个entity_id字段的滚动度量。我有多个变量想要创建,并且希望在一个调用中运行它们。

例如,如果我使用agg创建这些度量,我会执行如下操作:

transactions = (
raw_transactions
.groupby(['entity_id','transaction_month'])[['inflow','outflow']]
.agg([
'sum','skew',
( 'coef_var', lambda x: x.std() / x.mean() ),
( 'kurtosis', lambda x: x.kurtosis() )
])
.reset_index()
)

但是,我无法使用transform来复制此内容。当我尝试使用dict或list传递函数时,由于list或dict不可更改,我会得到TypeError。

>>> transactions.groupby(['entity_id'])[['inflow','outflow']].transform(['skew','mean'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-62-4ef49d836b3f> in <module>
----> 1 transactions.groupby(['entity_id'])[['inflow','outflow']].transform(['skew','mean'])
/jupyter/packages/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
1354 
1355         # optimized transforms
-> 1356         func = self._get_cython_func(func) or func
1357 
1358         if not isinstance(func, str):
/jupyter/packages/pandas/core/base.py in _get_cython_func(self, arg)
335         if we define an internal function for this argument, return it
336         """
--> 337         return self._cython_table.get(arg)
338 
339     def _is_builtin_func(self, arg):
TypeError: unhashable type: 'list'

我认为transform不可能。你至少有两个变通办法。mergegroupby.agg在原始数据帧上的结果:

tmp_ = (
raw_transactions
.groupby(['entity_id','transaction_month'])[['inflow','outflow']]
.agg([
'sum','skew',
( 'coef_var', lambda x: x.std() / x.mean() ),
( 'kurtosis', lambda x: x.kurtosis() )
]) #no reset_index here
)
# need to flatten multiindex columns
tmp_.columns = ['_'.join(cols) for cols in tmp_.columns] 
# then merge with original dataframe
res = raw_transactions.merge(tmp_, on=['entity_id','transaction_month'])

或者使用对不同函数的列表理解在具有原始数据的CCD_ 9中进行转换

# group once
gr = raw_transactions.groupby(['entity_id'])[['inflow','outflow']]
#concat each dataframe of transformed function with otiginal data
res = pd.concat([raw_transactions] + 
[gr.transform(func) 
for func in ('skew', 'mean', lambda x: x.std() / x.mean() )], 
axis=1, keys=('', 'skew', 'mean', 'coef_var'))

然后您可以处理名为的列

最新更新