假设我有一个包含用户事件的数据框架
+---------+------------------+---------------------+
| user_id | event_name | timestamp |
+---------+------------------+---------------------+
| 1 | HomeAppear | 2020-12-13 06:38:14 |
+---------+------------------+---------------------+
| 1 | TariffsAppear | 2020-12-13 06:40:13 |
+---------+------------------+---------------------+
| 1 | CheckoutPayClick | 2020-12-13 06:50:12 |
+---------+------------------+---------------------+
| 2 | HomeAppear | 2020-12-13 11:38:33 |
+---------+------------------+---------------------+
| 2 | TariffsAppear | 2020-12-13 11:39:18 |
+---------+------------------+---------------------+
对于每个用户在他的最后一个(按时间戳)事件之后,我想添加新的行'End'事件与在前一个事件相同的时间戳:
+---------+------------------+---------------------+
| 1 | End | 2020-12-13 06:50:12 |
+---------+------------------+---------------------+
我不知道该怎么做。在SQL中,我将使用LAG()或LEAD()来执行此操作。但是熊猫呢?
将User_id
的最后一行改为DataFrame.drop_duplicates
,将event_name
改为End
,并将concat
与排序索引相加(添加最安全排序mergesort
):
#if necessary sorting
df = df.sort_values(['user_id', 'timestamp'], ignore_index=True)
df2 = df.drop_duplicates('user_id', keep='last').assign(event_name = 'End')
df = pd.concat([df, df2]).sort_index(kind='mergesort').reset_index(drop=True)
print (df)
user_id event_name timestamp
0 1 HomeAppear 2020-12-13 06:38:14
1 1 TariffsAppear 2020-12-13 06:40:13
2 1 CheckoutPayClick 2020-12-13 06:50:12
3 1 End 2020-12-13 06:50:12
4 2 HomeAppear 2020-12-13 11:38:33
5 2 TariffsAppear 2020-12-13 11:39:18
6 2 End 2020-12-13 11:39:18
你可以这样做:
df = df.sort_values(['user_id', 'timestamp'])
df1=pd.DataFrame({'user_id':np.unique(df['user_id']),'event_name':'End','timestamp':np.NaN})
df=pd.concat([df,df1],axis=0).sort_values(by='user_id')
df['timestamp']=df['timestamp'].fillna(method='ffill')