计算另一列的上一个值



如果对于上面的行(1(上面的公司是同一家公司,并且(2(该类型是家,则源就是访问。数据帧已排序。但依赖前一行意味着,如果中间有行,则访问不会被分类:在这里,第1行是访问,第2行是访问。只要时间差在5分钟以内,我该如何对这些访问进行分类?

source datetime location  type  start  company 
0          10:00    london    home  1       apple
1          10:03    unknown                 tesla
2          10:04    France                  apple
3          10:05    Melbourne home  1       apple
4    visit 10:06    France                  apple

1004在10.00的5分钟内,所以第2排应该是一次访问。它还符合访问的两个条件。预期输出

source datetime location  type  start  company 
0          10:00    london    home  1       apple
1          10:03    unknown                 tesla
2    visit 10:04    France                  apple
3          10:05    Melbourne home  1       apple
4    visit 10:06    France                  apple

这里有一种方法可以实现

#create a reference date, with datetime where source is 'home'  
df['ref_date'] = df[df['type'].str.strip() !='']['datetime']
#downfill the ref_date grouping by company
df['ref_date']=df.groupby('company')['ref_date'].fillna(method='ffill').fillna(0)
# use np.where to populate the source, where datetime and ref-date are different
# and the time difference is 5 mins or less
df['source']=np.where(  ((df['datetime']!=df['ref_date']) &
((pd.to_datetime(df['datetime']).sub(pd.to_datetime(df['ref_date'])).dt.total_seconds()/60) <=5)), 
'visit',df['source'])
df=df.drop(columns='ref_date')
df
source  datetime    location    type    start   company
0              10:00    london      home    1.0     apple
1              10:03    unknown                     tesla
2   visit      10:04    France                      apple
3              10:05    Melbourne   home    1.0     apple
4   visit      10:06    France                      apple

相关内容

最新更新