2021-08-01 1 NaN 2 >NaN3 >NaN2022-08-01 1 aN2
我的数据的一个小样本:
pd.DataFrame({'date': {0: Timestamp('2021-08-01 00:00:00'),
1: Timestamp('2022-08-01 00:00:00'),
2: Timestamp('2021-08-01 00:00:00'),
3: Timestamp('2021-08-01 00:00:00'),
4: Timestamp('2022-08-01 00:00:00'),
5: Timestamp('2022-08-01 00:00:00')},
'customer_nr': {0: 2, 1: 3, 2: 2, 3: 3, 4: 2, 5: 2},
'product_nr': {0: 3, 1: 2, 2: 2, 3: 1, 4: 2, 5: 1},
'age': {0: 32.0, 1: 32.0, 2: 32.0, 3: 32.0, 4: 32.0, 5: 37.0},
'gender': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'M', 5: 'M'},
'age_group': {0: '25-34',
1: '25-34',
2: '25-34',
3: '25-34',
4: '25-34',
5: '35-44'}} )
然后我想重新分组如下:
df.groupby(['date','product_nr','age_group']).age.count().unstack()
看起来像:
t = df.groupby(['date','product_nr','age_group']).age.count().unstack()
在product_nr上应用百分比更改分组:(如果您有多个日期,此代码也适用(
output = pd.DataFrame()
for group,df in t.groupby('product_nr'):
temp = ((df/df.shift(1))-1)*100
output = pd.concat([output,temp])
output.reset_index(inplace=True)
输出:
age_group date product_nr 25-34 35-44
0 2021-08-01 1 NaN NaN
1 2022-08-01 1 NaN NaN
2 2021-08-01 2 NaN NaN
3 2022-08-01 2 100.0 NaN
4 2021-08-01 3 NaN NaN
获取所需日期的输出:
output[output['date'] == '2022-08-01']
最终输出:
age_group date product_nr 25-34 35-44
1 2022-08-01 1 NaN NaN
3 2022-08-01 2 100.0 NaN
您可以保存新分组的df,并使用fillna方法填充其中的nan。
df_group = df.groupby(['date','product_nr','age_group']).age.count().unstack().fillna(0)
然后,您可以将2022年和2021年的数据保存到新变量中
df_2022 = df_group.loc["2022-08-01"]
df_2021 = df_group.loc["2021-08-01"]
然后将它们相减,除以原始值,得到百分比差。
(df_2022 - df_2021).divide(df_2021)
使用Groupby.apply和Series.pct更改
df['pct_ch']=(df.groupby(columns([age.count].apply(pd.series.pct_change(+1(