在时间序列数据中插入行作为分隔符



我正在做一个在线课程中分析学生点击行为的项目,我们将点击路径视为顺序数据,它看起来像:

user_id timestamp          duration_sec     Page        
545301  8/25/2020 14:49    5                home        
545301  8/25/2020 15:00    10               instructor  
545301  9/2/2020  13:33    5                home       
545301  9/8/2020  12:46    3                home        
545301  9/9/2020  11:10    3                home       
545301  9/9/2020  13:24    8                general    
545301  9/9/2020  14:33    12               zoom    

我想做的是在子系列之间添加行作为分隔符,以指示学生在两个行为系列之间休息(即。两次点击之间的时间间隔很长,比如超过3个小时)。期望的数据应该是这样的:

user_id timestamp          duration_sec     Page        
545301  8/25/2020 14:49    5                home        
545301  8/25/2020 15:00    10               instructor
545301  8/25/2020 15:10    99999            break
545301  9/2/2020  13:33    5                home    
545301  9/2/2020  13:38    99999            break
545301  9/8/2020  12:46    3                home
545301  9/8/2020  12:49    99999            break       
545301  9/9/2020  11:10    3                home       
545301  9/9/2020  13:24    8                general    
545301  9/9/2020  14:33    12               zoom    

如果有人能给我一些提示,我将非常感激。

更新在@Abdel的帮助下,我意识到了这一点:首先,添加一个名为gap的列,它表示距离下一次单击

的时间(小时)。
df['gap']= (df['start_time_shift'] - df['start_time']).dt.total_seconds()/3600

第二,定义一个添加断点的函数:

def add_breaks(df):
List_of_breaks=[]
for i in range(len(df.index)):
if df.gap.iloc[i] > 3: 
List_of_breaks.append(i)

for i in List_of_breaks:
line = pd.DataFrame({'course_id':df.course_id.iloc[i],
"user_id": df.user_id.iloc[i], 
"start_time": df.end_time.iloc[i], 
'end_time':df.start_time_shift.iloc[i],
'start_time_shift':df.start_time_shift.iloc[i+1],
'duration_min':df.start_time_shift.iloc[i]-df.end_time.iloc[i],
'gap': 'gap',
'page':'break}, index=[i+1])
df=pd.concat([df.iloc[:i+1], line, df.iloc[i+2:]]).reset_index(drop=True)                 

return(df)

我修改了@Abdel的答案,1)将过滤条件更改为gap的值。2)加行方式,应为df.iloc[:i+1], line, df.iloc[i+2:]]).reset_index(drop=True)

这是我的答案(对问题的第一个版本,因为一些数据确实发生了变化):

首先,我确实构建了一些类似于您的数据框架的东西

import pandas as pd
from io import StringIO
import re
f= re.sub('s+',',',re.sub('n','..',"""user_id timestamp duration_min  Page        
545301  8/25/2020_14:49    8.600000         home        
545301  8/25/2020_15:00    10.100000        instructor  
545301  9/2/2020_13:33    49.700000        home       
545301  9/8/2020_12:46    223.783333       home        
545301  9/9/2020_11:10    7.633333         home       
545301  9/9/2020_13:24    69.300000        general    
545301  9/9/2020_14:33    2651.133333      zoom 
""")).replace('..','n')
f = StringIO(f)
df= pd.read_csv(f)
df['timestamp']=df['timestamp'].str.replace('_',' ',regex=False)
df.drop(['Unnamed: 4'],axis=1, inplace=True)
df.timestamp = pd.to_datetime(df.timestamp)
print(df)

然后我们需要找到我们必须插入行的索引:

List_of_home_indexes=[]
for i in range(len(df.index)):
if df.Page.iloc[i] =='home': List_of_home_indexes.append(i)
print(List_of_home_indexes)

然后在定义行之后插入该行,通过对找到的索引执行循环:

from datetime import timedelta
for i in List_of_home_indexes:
line = pd.DataFrame({"user_id": 545301, "timestamp": df.timestamp.iloc[i]+ timedelta(seconds=1), 'duration_min':99999, 'Page':'break'}, index=[i+1])
df=pd.concat([df.iloc[:i+1], line, df.iloc[i+1:]]).reset_index(drop=True)
print(df)

你得到了你想要的结果。

对于这个解决方案,我假设user_id不是索引。如果是,请在启动前重置索引。

首先定义idle_time"通过获取时间戳之间的差异,并考虑到duration_sec(我们需要首先将其从数字转换为时间增量):

df['idle_time'] = df.timestamp.diff().shift(-1) - pd.to_timedelta(df.duration_sec, unit='s')
user_id           timestamp  duration_sec        Page       idle_time
0   545301 2020-08-25 14:49:00             5        home 0 days 00:10:55
1   545301 2020-08-25 15:00:00            10  instructor 7 days 22:32:50
2   545301 2020-09-02 13:33:00             5        home 5 days 23:12:55
3   545301 2020-09-08 12:46:00             3        home 0 days 22:23:57
4   545301 2020-09-09 11:10:00             3        home 0 days 02:13:57
5   545301 2020-09-09 13:24:00             8     general 0 days 01:08:52
6   545301 2020-09-09 14:33:00            12        zoom             NaT

然后我们获取学生休息之前的行,在本例中,我定义为超过6小时的idle_time(但您可以将其更改为任何您想要的):

pre_breaks = df[df.idle_time > pd.to_timedelta(6, unit='h')]
user_id           timestamp  duration_sec        Page       idle_time
1   545301 2020-08-25 15:00:00            10  instructor 7 days 22:32:50
2   545301 2020-09-02 13:33:00             5        home 5 days 23:12:55
3   545301 2020-09-08 12:46:00             3        home 0 days 22:23:57

然后将这些行修改为换行,如下所示:

pre_breaks['timestamp'] = pre_breaks.timestamp + 
pd.to_timedelta(pre_breaks.duration_sec, 's')
pre_breaks['Page'] = 'break'
pre_breaks['duration_sec'] = pre_breaks.idle_time.apply(lambda x:x.seconds)
user_id           timestamp  duration_sec   Page  idle_time
1   545301 2020-08-25 15:00:10         81170  break        NaN
2   545301 2020-09-02 13:33:05         83575  break        NaN
3   545301 2020-09-08 12:46:03         80637  break        NaN

然后我们将它们插入到学生休息的事件对应的索引中:

for i in pre_breaks.index:
df.loc[i+0.5] = pre_breaks.loc[i]
user_id           timestamp  duration_sec        Page       idle_time
0.0   545301 2020-08-25 14:49:00             5        home 0 days 00:10:55
1.0   545301 2020-08-25 15:00:00            10  instructor 7 days 22:32:50
2.0   545301 2020-09-02 13:33:00             5        home 5 days 23:12:55
3.0   545301 2020-09-08 12:46:00             3        home 0 days 22:23:57
4.0   545301 2020-09-09 11:10:00             3        home 0 days 02:13:57
5.0   545301 2020-09-09 13:24:00             8     general 0 days 01:08:52
6.0   545301 2020-09-09 14:33:00            12        zoom             NaT
1.5   545301 2020-08-25 15:00:10         81170       break             NaT
2.5   545301 2020-09-02 13:33:05         83575       break             NaT
3.5   545301 2020-09-08 12:46:03         80637       break             NaT

最后,对索引进行排序并重置。我们还删除了idle_time列(可选):

df = df.sort_index().reset_index(drop=True).drop(columns='idle_time')

最终结果:

user_id           timestamp  duration_sec        Page
0   545301 2020-08-25 14:49:00             5        home
1   545301 2020-08-25 15:00:00            10  instructor
2   545301 2020-08-25 15:00:10         81170       break
3   545301 2020-09-02 13:33:00             5        home
4   545301 2020-09-02 13:33:05         83575       break
5   545301 2020-09-08 12:46:00             3        home
6   545301 2020-09-08 12:46:03         80637       break
7   545301 2020-09-09 11:10:00             3        home
8   545301 2020-09-09 13:24:00             8     general
9   545301 2020-09-09 14:33:00            12        zoom

最新更新