我正在做一个在线课程中分析学生点击行为的项目,我们将点击路径视为顺序数据,它看起来像:
user_id timestamp duration_sec Page
545301 8/25/2020 14:49 5 home
545301 8/25/2020 15:00 10 instructor
545301 9/2/2020 13:33 5 home
545301 9/8/2020 12:46 3 home
545301 9/9/2020 11:10 3 home
545301 9/9/2020 13:24 8 general
545301 9/9/2020 14:33 12 zoom
我想做的是在子系列之间添加行作为分隔符,以指示学生在两个行为系列之间休息(即。两次点击之间的时间间隔很长,比如超过3个小时)。期望的数据应该是这样的:
user_id timestamp duration_sec Page
545301 8/25/2020 14:49 5 home
545301 8/25/2020 15:00 10 instructor
545301 8/25/2020 15:10 99999 break
545301 9/2/2020 13:33 5 home
545301 9/2/2020 13:38 99999 break
545301 9/8/2020 12:46 3 home
545301 9/8/2020 12:49 99999 break
545301 9/9/2020 11:10 3 home
545301 9/9/2020 13:24 8 general
545301 9/9/2020 14:33 12 zoom
如果有人能给我一些提示,我将非常感激。
更新在@Abdel的帮助下,我意识到了这一点:首先,添加一个名为
的时间(小时)。gap
的列,它表示距离下一次单击
df['gap']= (df['start_time_shift'] - df['start_time']).dt.total_seconds()/3600
第二,定义一个添加断点的函数:
def add_breaks(df):
List_of_breaks=[]
for i in range(len(df.index)):
if df.gap.iloc[i] > 3:
List_of_breaks.append(i)
for i in List_of_breaks:
line = pd.DataFrame({'course_id':df.course_id.iloc[i],
"user_id": df.user_id.iloc[i],
"start_time": df.end_time.iloc[i],
'end_time':df.start_time_shift.iloc[i],
'start_time_shift':df.start_time_shift.iloc[i+1],
'duration_min':df.start_time_shift.iloc[i]-df.end_time.iloc[i],
'gap': 'gap',
'page':'break}, index=[i+1])
df=pd.concat([df.iloc[:i+1], line, df.iloc[i+2:]]).reset_index(drop=True)
return(df)
我修改了@Abdel的答案,1)将过滤条件更改为gap的值。2)加行方式,应为df.iloc[:i+1], line, df.iloc[i+2:]]).reset_index(drop=True)
这是我的答案(对问题的第一个版本,因为一些数据确实发生了变化):
首先,我确实构建了一些类似于您的数据框架的东西
import pandas as pd
from io import StringIO
import re
f= re.sub('s+',',',re.sub('n','..',"""user_id timestamp duration_min Page
545301 8/25/2020_14:49 8.600000 home
545301 8/25/2020_15:00 10.100000 instructor
545301 9/2/2020_13:33 49.700000 home
545301 9/8/2020_12:46 223.783333 home
545301 9/9/2020_11:10 7.633333 home
545301 9/9/2020_13:24 69.300000 general
545301 9/9/2020_14:33 2651.133333 zoom
""")).replace('..','n')
f = StringIO(f)
df= pd.read_csv(f)
df['timestamp']=df['timestamp'].str.replace('_',' ',regex=False)
df.drop(['Unnamed: 4'],axis=1, inplace=True)
df.timestamp = pd.to_datetime(df.timestamp)
print(df)
然后我们需要找到我们必须插入行的索引:
List_of_home_indexes=[]
for i in range(len(df.index)):
if df.Page.iloc[i] =='home': List_of_home_indexes.append(i)
print(List_of_home_indexes)
然后在定义行之后插入该行,通过对找到的索引执行循环:
from datetime import timedelta
for i in List_of_home_indexes:
line = pd.DataFrame({"user_id": 545301, "timestamp": df.timestamp.iloc[i]+ timedelta(seconds=1), 'duration_min':99999, 'Page':'break'}, index=[i+1])
df=pd.concat([df.iloc[:i+1], line, df.iloc[i+1:]]).reset_index(drop=True)
print(df)
你得到了你想要的结果。
对于这个解决方案,我假设user_id不是索引。如果是,请在启动前重置索引。
首先定义idle_time"通过获取时间戳之间的差异,并考虑到duration_sec(我们需要首先将其从数字转换为时间增量):
df['idle_time'] = df.timestamp.diff().shift(-1) - pd.to_timedelta(df.duration_sec, unit='s')
user_id timestamp duration_sec Page idle_time
0 545301 2020-08-25 14:49:00 5 home 0 days 00:10:55
1 545301 2020-08-25 15:00:00 10 instructor 7 days 22:32:50
2 545301 2020-09-02 13:33:00 5 home 5 days 23:12:55
3 545301 2020-09-08 12:46:00 3 home 0 days 22:23:57
4 545301 2020-09-09 11:10:00 3 home 0 days 02:13:57
5 545301 2020-09-09 13:24:00 8 general 0 days 01:08:52
6 545301 2020-09-09 14:33:00 12 zoom NaT
然后我们获取学生休息之前的行,在本例中,我定义为超过6小时的idle_time(但您可以将其更改为任何您想要的):
pre_breaks = df[df.idle_time > pd.to_timedelta(6, unit='h')]
user_id timestamp duration_sec Page idle_time
1 545301 2020-08-25 15:00:00 10 instructor 7 days 22:32:50
2 545301 2020-09-02 13:33:00 5 home 5 days 23:12:55
3 545301 2020-09-08 12:46:00 3 home 0 days 22:23:57
然后将这些行修改为换行,如下所示:
pre_breaks['timestamp'] = pre_breaks.timestamp +
pd.to_timedelta(pre_breaks.duration_sec, 's')
pre_breaks['Page'] = 'break'
pre_breaks['duration_sec'] = pre_breaks.idle_time.apply(lambda x:x.seconds)
user_id timestamp duration_sec Page idle_time
1 545301 2020-08-25 15:00:10 81170 break NaN
2 545301 2020-09-02 13:33:05 83575 break NaN
3 545301 2020-09-08 12:46:03 80637 break NaN
然后我们将它们插入到学生休息的事件对应的索引中:
for i in pre_breaks.index:
df.loc[i+0.5] = pre_breaks.loc[i]
user_id timestamp duration_sec Page idle_time
0.0 545301 2020-08-25 14:49:00 5 home 0 days 00:10:55
1.0 545301 2020-08-25 15:00:00 10 instructor 7 days 22:32:50
2.0 545301 2020-09-02 13:33:00 5 home 5 days 23:12:55
3.0 545301 2020-09-08 12:46:00 3 home 0 days 22:23:57
4.0 545301 2020-09-09 11:10:00 3 home 0 days 02:13:57
5.0 545301 2020-09-09 13:24:00 8 general 0 days 01:08:52
6.0 545301 2020-09-09 14:33:00 12 zoom NaT
1.5 545301 2020-08-25 15:00:10 81170 break NaT
2.5 545301 2020-09-02 13:33:05 83575 break NaT
3.5 545301 2020-09-08 12:46:03 80637 break NaT
最后,对索引进行排序并重置。我们还删除了idle_time列(可选):
df = df.sort_index().reset_index(drop=True).drop(columns='idle_time')
最终结果:
user_id timestamp duration_sec Page
0 545301 2020-08-25 14:49:00 5 home
1 545301 2020-08-25 15:00:00 10 instructor
2 545301 2020-08-25 15:00:10 81170 break
3 545301 2020-09-02 13:33:00 5 home
4 545301 2020-09-02 13:33:05 83575 break
5 545301 2020-09-08 12:46:00 3 home
6 545301 2020-09-08 12:46:03 80637 break
7 545301 2020-09-09 11:10:00 3 home
8 545301 2020-09-09 13:24:00 8 general
9 545301 2020-09-09 14:33:00 12 zoom