如何计算每个月或一个月范围内的活动日期



我有一个类似DataFrame的:

Student_id actvity_timestamp
1001 2019-09-05:08:26:12
1001 2019-09-06 09:26:12
1001 2019-09-21 10:11:01
1001 2019-10-24 11:44:01
1001 2019-10-25 11:31:01
1001 2019-10-26 12:13:01
1002 2019-09-11 12:21:01
1002 2019-09-12 13:11:01
1002 2019-11-23 16:22:01

你可以尝试做一些类似的事情:

df = pd.DataFrame.from_dict({
"Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002],
"actvity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"]
})
months = pd.to_datetime(df.actvity_timestamp).dt.strftime("%B")
result = pd.crosstab(
df.Student_id,
months,
values=df.activity_timestamp.dt.date,
aggfunc=pd.Series.nunique # These last two parameters make it so that if a Student_id has been active more than once in a single day, to count it only once. (Thanks to @tlentali)
).fillna(0)

Series.dt.strftime适用于日期时间系列,%B将日期时间格式化为仅显示月份名称。

result将产生

actvity_timestamp  November  October  September
Student_id                                     
1001                      0        3          3
1002                      1        0          2

您可以通过以下步骤获得所需的布局(列名按正确的月份顺序排序:"Sep"->"Oct"-<"Nov"而不是"Nov’->"Oct’-&"Sep"(

1(创建一个月短名称的列。然后使用.pivot_table()转换数据帧 (聚合每个Student_id下每个月的活动日期计数(:

df['actvity_timestamp'] = pd.to_datetime(df['actvity_timestamp']) # to datetime format 
df['activity_month'] = df['actvity_timestamp'].dt.strftime('%b')  # get month short name
df['activity_date'] = df['actvity_timestamp'].dt.date     # get activity dates
df_out = (df.pivot_table(index='Student_id',   # group under each student id
columns='activity_month',  # month short name as new columns
values='activity_date',    # aggregate on dates
aggfunc='nunique',    #activities on the same date counted once
fill_value=0)
.rename_axis(columns=None)                          
)

Nov  Oct  Sep
Student_id               
1001          0    3    3
1002          1    0    2

2(使用排序键参数.sort_index将月短名称的列名按日历顺序排序,如下所示:

df_out = df_out.sort_index(axis=1, key=lambda x: pd.to_datetime(x, format='%b').month)

Sep  Oct  Nov
Student_id               
1001          3    3    0
1002          2    0    1

3(通过.add_prefix():进一步转换到所需布局

df_out = df_out.add_prefix('total_active_days_in_').reset_index()

结果:

print(df_out)
Student_id  total_active_days_in_Sep  total_active_days_in_Oct  total_active_days_in_Nov
0        1001                         3                         3                         0
1        1002                         2                         0                         1

Dataframe:开始

>>> import pandas as pd
>>> df = pd.DataFrame({'Student_id': [1001, 1001, 1001, 1001, 1001, 1001, 1002, 1002, 1002],
...                    'activity_timestamp': ['2019-09-05 08:26:12', '2019-09-06 09:26:12', '2019-09-21 10:11:01', '2019-10-24 11:44:01', '2019-10-25 11:31:01', '2019-10-26 12:13:01', '2019-09-11 12:21:01', '2019-09-12 13:11:01', '2019-11-23 16:22:01']}, 
...                   index = [0, 1, 2, 3, 4, 5, 6, 7, 8])
>>> df
Student_id  activity_timestamp
0   1001        2019-09-05 08:26:12
1   1001        2019-09-06 09:26:12
2   1001        2019-09-21 10:11:01
3   1001        2019-10-24 11:44:01
4   1001        2019-10-25 11:31:01
5   1001        2019-10-26 12:13:01
6   1002        2019-09-11 12:21:01
7   1002        2019-09-12 13:11:01
8   1002        2019-11-23 16:22:01

我们将activity_timestamp转换为datetime,并提取日期和月份编号,如下所示:

>>> df['activity_timestamp'] = pd.to_datetime(df['activity_timestamp'], format='%Y-%m-%d %H:%M:%S.%f')
>>> df['date'] = df['activity_timestamp'].dt.date
>>> df['month'] = df['activity_timestamp'].dt.month_name()
>>> df
Student_id  activity_timestamp  date        month
0   1001        2019-09-05 08:26:12 2019-09-05  September
1   1001        2019-09-05 08:26:13 2019-09-05  September
2   1001        2019-09-06 09:26:12 2019-09-06  September
3   1001        2019-09-21 10:11:01 2019-09-21  September
4   1001        2019-10-24 11:44:01 2019-10-24  October
5   1001        2019-10-25 11:31:01 2019-10-25  October
6   1001        2019-10-26 12:13:01 2019-10-26  October
7   1002        2019-09-11 12:21:01 2019-09-11  September
8   1002        2019-09-12 13:11:01 2019-09-12  September
9   1002        2019-11-23 16:22:01 2019-11-23  November

然后,我们使用带有nunique函数的pivot_table()方法而不是count来获得唯一日期的数目:

>>> df_result = (df.pivot_table(index='Student_id', 
...                             columns='month', 
...                             values='date', 
...                             aggfunc=pd.Series.nunique, 
...                             fill_value=0).rename_axis(columns=None)).add_prefix('total_active_days_in_').reset_index(drop=False)
>>> df_result
Student_id  total_active_days_in_November   total_active_days_in_October    total_active_days_in_September
0   1001        0                               3                               3
1   1002        1                               0                               2

感谢@SeaBean的add_prefix方法。

最新更新