基于GroupBy的Pandas数据框上的自定义移动平均线



有一个列为IDFeature_1的数据集。CCD_ 3可以理解为以秒为单位的会话的特定持续时间。还有一个自定义函数,它根据窗口宽度导致的NaN数量,在开始时添加简单平均值来计算移动平均值。这是:

def moving_average_mit_anfang(x, w):
# First part - simple average
first_part_result = np.cumsum(x)/np.cumsum(np.ones(len(x)))
# If appearence of user's sessions is greater than window width, we calculate moving average
if len(x)>w:
# Second part - moving average with window w
sec_part_result = np.convolve(x, np.ones(w), 'valid') / w
return np.append(first_part_result[:-len(sec_part_result)],sec_part_result)
# Otherwise we calculate only simple average
else:
return first_part_result

我们应该在Featrue_1列上应用这个函数,根据相应ID的出现时间,得到每个ID的当前平均值。

示例数据帧:

pd.DataFrame(data={'ID':[1,2,3,2,3,1,2,1,3,3,3,2,1],
'Feature_1':[4,5,6,73,2,21,13,45,32,9,18,45,39]})

我试过这个:

test_df.groupby('ID')['Feature_1'].transform(lambda x: moving_average_mit_anfang(x,1))

得到了这个:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-6cc3e6c9b134> in <module>
----> 1 test_df.groupby('ID')['Feature_1'].transform(lambda x: moving_average_mit_anfang(x,1))
~/DS/RS/rs_env/lib/python3.8/site-packages/pandas/core/groupby/generic.py in transform(self, func, engine, engine_kwargs, *args, **kwargs)
505 
506         if not isinstance(func, str):
--> 507             return self._transform_general(func, *args, **kwargs)
508 
509         elif func not in base.transform_kernel_allowlist:
~/DS/RS/rs_env/lib/python3.8/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
535                 res = res._values
536 
--> 537             results.append(klass(res, index=group.index))
538 
539         # check for empty "results" to avoid concat ValueError
~/DS/RS/rs_env/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
346                 try:
347                     if len(index) != len(data):
--> 348                         raise ValueError(
349                             f"Length of passed values is {len(data)}, "
350                             f"index implies {len(index)}."
ValueError: Length of passed values is 6, index implies 4.

输出应该像:

ID  Feature_1  Custom average
0    1          4             4.0
1    2          5             5.0
2    3          6             6.0
3    2         73            39.0
4    3          2             4.0
5    1         21            12.5
6    2         13            43.0
7    1         45            33.0
8    3         32             4.0
9    3          9            20.5
10   3         18            13.5
11   2         45            29.0
12   1         39            42.0

您的新解决方案正在运行,也可以为更简单的解决方案省略lambda函数(lambda也在运行(:

test_df['Custom average'] = test_df.groupby('ID')['Feature_1'].transform(moving_average_mit_anfang,2)
print (test_df)
ID  Feature_1  Custom average
0    1          4             4.0
1    2          5             5.0
2    3          6             6.0
3    2         73            39.0
4    3          2             4.0
5    1         21            12.5
6    2         13            43.0
7    1         45            33.0
8    3         32            17.0
9    3          9            20.5
10   3         18            13.5
11   2         45            29.0
12   1         39            42.0

最新更新