计算大熊猫列前3个月、6个月、1年和2年平均值的滞后特征



你好,我有一个具有月(int),年(int), product_id(字符串),city_id(字符串),customer_id(字符串)和销售(int)的pandas dataframe,我想首先按特定的customer_product_city_month_year组合进行分组,并获得所有组合的销售列的平均值为前3个月,6个月,1年和2年作为pandas dataframe中的新列

我将感谢任何反馈由于

准备一堆数据:

import pandas as pd
import numpy as np
# sample data
dti = pd.date_range("2017-1-1", "2020-12-1", freq="MS")
products = ["cities{}".format(c) for c in range(1, 4)]
cities = ["cities{}".format(c) for c in range(1, 3)]
customers = ["customer{}".format(c) for c in range(1, 5)]
idx = pd.MultiIndex.from_product([dti.month, dti.year, products, cities, customers],
names=["month", "year", "product_id", "city_id", "customer_id"])
df = pd.DataFrame({"sales": np.random.randint(0, 30, size=len(idx))}, index=idx).reset_index()
>>> df
month  year product_id  city_id customer_id  sales
0          1  2017  products1  cities1   customer1     20
1          1  2017  products1  cities1   customer2     28
2          1  2017  products1  cities1   customer3      5
3          1  2017  products1  cities1   customer4     17
4          1  2017  products1  cities2   customer1     16
...      ...   ...        ...      ...         ...    ...
55291     12  2020  products3  cities1   customer4      6
55292     12  2020  products3  cities2   customer1      1
55293     12  2020  products3  cities2   customer2     21
55294     12  2020  products3  cities2   customer3     16
55295     12  2020  products3  cities2   customer4      3

将(年,月)转换为日期时间。使用日期操作更容易后:

df2 = df.assign(datetime=pd.to_datetime(df["year"].astype(str) + df["month"].astype(str), format="%Y%m"))
df2 = df2.drop(columns=["month", "year"]).set_index("datetime").sort_index()
cols = ["product_id", "city_id", "customer_id"]
out = []
for month in [3, 6, 12, 24]:
start = df2.index.max() - pd.tseries.offsets.DateOffset(months=month)
out.append(df2[start:].groupby(cols).mean().squeeze().rename("{}M".format(month)))

输出:

>>> pd.concat(out, axis="columns")
3M         6M        12M        24M
product_id city_id customer_id
products1  cities1 customer1    15.161458  14.779762  14.788462  14.433333
customer2    14.479167  14.580357  14.706731  14.761667
customer3    13.666667  13.738095  14.229167  14.680000
customer4    14.828125  14.875000  14.641026  14.539167
cities2 customer1    14.723958  14.211310  14.161859  14.440000
customer2    14.791667  15.083333  14.907051  14.781667
customer3    15.015625  14.273810  14.605769  14.664167
customer4    14.036458  14.505952  14.358974  14.267500
products2  cities1 customer1    13.666667  13.755952  14.166667  14.179167
customer2    13.484375  14.375000  14.025641  14.277500
customer3    13.645833  14.119048  14.307692  14.296667
customer4    13.244792  13.657738  14.112179  14.137500
cities2 customer1    13.682292  14.541667  14.434295  14.366667
customer2    13.979167  14.116071  14.304487  14.592500
customer3    16.489583  15.744048  14.828526  14.671667
customer4    14.625000  14.505952  14.099359  14.409167
products3  cities1 customer1    13.562500  14.125000  14.149038  14.221667
customer2    14.713542  14.717262  14.520833  14.060833
customer3    14.562500  14.696429  14.560897  14.596667
customer4    14.536458  14.651786  14.314103  14.075833
cities2 customer1    14.130208  14.559524  14.089744  14.365000
customer2    14.276042  14.279762  14.352564  14.273333
customer3    13.901042  14.657738  14.641026  14.254167
customer4    14.411458  14.303571  14.258013  14.265000

我得到一个错误,因为我也有一个星期变量,所以我的数据框架有点不同:

month year product_id city_id customer_id sales1 2017产品1城市1客户11 . 2017年产品1 .城市1 .客户12、2017年产品1、城市1、客户13 . 2017年产品1 .城市2 .客户24 1 2017 products1 cities3 customer3 5

我得到的错误是TypeError: Index(…)必须与某种类型的集合一起调用,'3M'已通过,任何反馈将非常感谢

最新更新