如果对于上面的行(1(上面的公司是同一家公司,并且(2(该类型是家,则源就是访问。数据帧已排序。但依赖前一行意味着,如果中间有行,则访问不会被分类:在这里,第1行是访问,第2行是访问。只要时间差在5分钟以内,我该如何对这些访问进行分类?
source datetime location type start company
0 10:00 london home 1 apple
1 10:03 unknown tesla
2 10:04 France apple
3 10:05 Melbourne home 1 apple
4 visit 10:06 France apple
1004在10.00的5分钟内,所以第2排应该是一次访问。它还符合访问的两个条件。预期输出
source datetime location type start company
0 10:00 london home 1 apple
1 10:03 unknown tesla
2 visit 10:04 France apple
3 10:05 Melbourne home 1 apple
4 visit 10:06 France apple
这里有一种方法可以实现
#create a reference date, with datetime where source is 'home'
df['ref_date'] = df[df['type'].str.strip() !='']['datetime']
#downfill the ref_date grouping by company
df['ref_date']=df.groupby('company')['ref_date'].fillna(method='ffill').fillna(0)
# use np.where to populate the source, where datetime and ref-date are different
# and the time difference is 5 mins or less
df['source']=np.where( ((df['datetime']!=df['ref_date']) &
((pd.to_datetime(df['datetime']).sub(pd.to_datetime(df['ref_date'])).dt.total_seconds()/60) <=5)),
'visit',df['source'])
df=df.drop(columns='ref_date')
df
source datetime location type start company
0 10:00 london home 1.0 apple
1 10:03 unknown tesla
2 visit 10:04 France apple
3 10:05 Melbourne home 1.0 apple
4 visit 10:06 France apple