我得到了一个包含以下列的数据框:product, product_launch_date, date, revenue。如何按30天的间隔计算产品的总收益?
对于每个不同的产品,product_launch_date在行中重复。
df['days_since_launch'] = (df['date'] - df['product_launch_date']).dt.days
我想以30天的间隔为每个产品存储'days_since_launch'。一个愚蠢的方法是手动绘制:
con1 = df['days_since_launch']<= 30
con2 = df['days_since_launch']> 30 |df['days_since_launch'] <= 60
...
df['date_bucket'] = 0
df.loc[con1, 'date_bucket'] = 1
df.loc[con2, 'date_bucket'] = 2
...
最后,我可以使用groupby按date_bucket进行聚合:
df.groupby(['product', 'date_bucket'])['revenue'].agg('sum').reset_index()
问题是如何以一种智能的方式生成date_bucket ?谢谢。
看起来你可以使用pd.cut
:
df['days_since_launch'] = (df['date'] - df['product_launch_date']).dt.days
bins = np.arange(df['days_since_launch'].min(), df['days_since_launch'].max()+1, 30)
df['date_bucket'] = pd.cut(df['days_since_launch'], bins=bins, labels=False)