我有一个应用程序日志,我正在将其拉入pandas数据框架中,将每个事件的各个部分解析为单独的列,这类似于:
data1 = {'timestamp': ['01-01-2021 12:00:00','01-01-2021 12:01:00','01-01-2021 12:02:00','01-01-2021 12:03:00','01-01-2021 12:04:00','01-01-2021 12:05:00','01-01-2021 12:06:00','01-01-2021 12:07:00','01-01-2021 12:08:00','01-01-2021 12:09:00','01-01-2021 12:10:00','01-01-2021 12:11:00','01-01-2021 12:12:00','01-01-2021 12:13:00','01-01-2021 12:14:00'],
'event': ['start','x','y','start','z','end','x','end','start','x','end','start','q','end','start'],
'account': ['bob','bob','bob','jane','bob','bob','jane','jane','todd','todd','todd','bob','bob','bob','ned'] }
df1 = pd.DataFrame(data1)
print(df1)
timestamp event account
0 01-01-2021 12:00:00 start bob
1 01-01-2021 12:01:00 x bob
2 01-01-2021 12:02:00 y bob
3 01-01-2021 12:03:00 start jane
4 01-01-2021 12:04:00 z bob
5 01-01-2021 12:05:00 end bob
6 01-01-2021 12:06:00 x jane
7 01-01-2021 12:07:00 end jane
8 01-01-2021 12:08:00 start todd
9 01-01-2021 12:09:00 x todd
10 01-01-2021 12:10:00 end todd
11 01-01-2021 12:11:00 start bob
12 01-01-2021 12:12:00 q bob
13 01-01-2021 12:13:00 end bob
14 01-01-2021 12:14:00 start ned
非常直接的日志。时间戳是字符串,但如果需要,可以很容易地转换为datetime对象。日志列出了用户执行的各种操作,但我感兴趣的是生成用户帐户会话的报告。像这样:
account start end
0 bob 01-01-2021 12:00:00 01-01-2021 12:05:00
1 jane 01-01-2021 12:03:00 01-01-2021 12:07:00
2 todd 01-01-2021 12:08:00 01-01-2021 12:10:00
3 bob 01-01-2021 12:11:00 01-01-2021 12:13:00
4 ned 01-01-2021 12:14:00 NaN
按开始和结束时间分组很容易,我无法弄清楚的问题是,当用户在我正在查看的时间段内有多个会话时,如何做到这一点。在上面的伪日志数据中,Bob打开了2个会话,但理论上可以打开100个会话,这取决于我正在查看的数据的日期范围。有问题的应用程序将用户一次限制为1个会话,所以我不应该看到同一个帐户同时打开2个会话。
当帐户可以在数据中有多个会话时,我如何将会话开始/结束时间分组到像这样的2列中?
尝试:
(df1.assign(idx=df1.event.eq('start').groupby(df1.account).cumsum()) # enumerate the `start` event by account
.loc[lambda x: x['event'].isin(['start','end'])] # only keep `start` and `end` events
.set_index(['idx','account','event']) # set index and unstack
['timestamp'].unstack() # then unstack
.reset_index(level=1).reset_index(drop=True) # tidying up
)
输出:
event account end start
0 bob 01-01-2021 12:05:00 01-01-2021 12:00:00
1 jane 01-01-2021 12:07:00 01-01-2021 12:03:00
2 ned NaN 01-01-2021 12:14:00
3 todd 01-01-2021 12:10:00 01-01-2021 12:08:00
4 bob 01-01-2021 12:13:00 01-01-2021 12:11:00
您可以识别启动并使用pivot_table
:
(df1.assign(idx=df1['event'].eq('start').groupby(df1['account']).cumsum())
.pivot_table(index=['account', 'idx'], columns='event', values='timestamp', aggfunc='first')
[['start', 'end']]
)
输出:
event start end
account idx
bob 1 01-01-2021 12:00:00 01-01-2021 12:05:00
2 01-01-2021 12:11:00 01-01-2021 12:13:00
jane 1 01-01-2021 12:03:00 01-01-2021 12:07:00
ned 1 01-01-2021 12:14:00 NaN
todd 1 01-01-2021 12:08:00 01-01-2021 12:10:00