移动平均值的窗口函数



我正在尝试在panda中复制SQL的窗口函数。

SELECT avg(totalprice) OVER (
PARTITION BY custkey
ORDER BY orderdate
RANGE BETWEEN interval '1' month PRECEDING AND CURRENT ROW)
FROM orders

我有这个数据帧:

from io  import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100
cust_2,2020-10-10,15
cust_1,2020-10-15,200
cust_1,2020-10-16,240
cust_2,2020-12-20,25
cust_1,2020-12-25,140
cust_2,2021-01-01,5
"""
u_cols=['customer_id', 'date', 'price']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
df=df.sort_values(list(df.columns))

在计算了限制在过去1个月的移动平均线后,它将是这样的。。。

from io  import StringIO
import pandas as pd
myst="""cust_1,2020-10-10,100,100
cust_2,2020-10-10,15,15
cust_1,2020-10-15,200,150
cust_1,2020-10-16,240,180
cust_2,2020-12-20,25,25
cust_1,2020-12-25,140,140
cust_2,2021-01-01,5,15
"""
u_cols=['customer_id', 'date', 'price', 'my_average']
myf = StringIO(myst)
import pandas as pd
my_df = pd.read_csv(StringIO(myst), sep=',', names = u_cols)
my_df=my_df.sort_values(list(my_df.columns))

如图所示:

https://trino.io/assets/blog/window-features/running-average-range.svg

我试着写这样一个函数。。。

import numpy as np
def mylogic(myro):
mylist = list()
mydate = myro['date'][0]
for i in range(len(myro)):            
if myro['date'][i] > mydate:
mylist.append(myro['price'][i])
mydate = myro['date'][i]
return np.mean(mylist)

但这返回了一个key_error。

您可以在最后30天的上使用滚动功能

df['date'] = pd.to_datetime(df['date'])    
df['my_average'] = (df.groupby('customer_id')
.apply(lambda d: d.rolling('30D', on='date')['price'].mean())
.reset_index(level=0, drop=True)
.astype(int)
)

输出:

customer_id       date  price  my_average
0      cust_1 2020-10-10    100         100
2      cust_1 2020-10-15    200         150
3      cust_1 2020-10-16    240         180
5      cust_1 2020-12-25    140         140
1      cust_2 2020-10-10     15          15
4      cust_2 2020-12-20     25          25
6      cust_2 2021-01-01      5          15

最新更新