按开始/结束时间戳分组数据帧行



我有一个应用程序日志,我正在将其拉入pandas数据框架中,将每个事件的各个部分解析为单独的列,这类似于:

data1 = {'timestamp': ['01-01-2021 12:00:00','01-01-2021 12:01:00','01-01-2021 12:02:00','01-01-2021 12:03:00','01-01-2021 12:04:00','01-01-2021 12:05:00','01-01-2021 12:06:00','01-01-2021 12:07:00','01-01-2021 12:08:00','01-01-2021 12:09:00','01-01-2021 12:10:00','01-01-2021 12:11:00','01-01-2021 12:12:00','01-01-2021 12:13:00','01-01-2021 12:14:00'],
'event':     ['start','x','y','start','z','end','x','end','start','x','end','start','q','end','start'],
'account':   ['bob','bob','bob','jane','bob','bob','jane','jane','todd','todd','todd','bob','bob','bob','ned'] }
df1 = pd.DataFrame(data1)
print(df1)
timestamp            event     account
0   01-01-2021 12:00:00  start     bob
1   01-01-2021 12:01:00      x     bob
2   01-01-2021 12:02:00      y     bob
3   01-01-2021 12:03:00  start    jane
4   01-01-2021 12:04:00      z     bob
5   01-01-2021 12:05:00    end     bob
6   01-01-2021 12:06:00      x    jane
7   01-01-2021 12:07:00    end    jane
8   01-01-2021 12:08:00  start    todd
9   01-01-2021 12:09:00      x    todd
10  01-01-2021 12:10:00    end    todd
11  01-01-2021 12:11:00  start     bob
12  01-01-2021 12:12:00      q     bob
13  01-01-2021 12:13:00    end     bob
14  01-01-2021 12:14:00  start     ned

非常直接的日志。时间戳是字符串,但如果需要,可以很容易地转换为datetime对象。日志列出了用户执行的各种操作,但我感兴趣的是生成用户帐户会话的报告。像这样:

account  start                end
0    bob      01-01-2021 12:00:00  01-01-2021 12:05:00
1    jane     01-01-2021 12:03:00  01-01-2021 12:07:00
2    todd     01-01-2021 12:08:00  01-01-2021 12:10:00
3    bob      01-01-2021 12:11:00  01-01-2021 12:13:00
4    ned      01-01-2021 12:14:00                  NaN

按开始和结束时间分组很容易,我无法弄清楚的问题是,当用户在我正在查看的时间段内有多个会话时,如何做到这一点。在上面的伪日志数据中,Bob打开了2个会话,但理论上可以打开100个会话,这取决于我正在查看的数据的日期范围。有问题的应用程序将用户一次限制为1个会话,所以我不应该看到同一个帐户同时打开2个会话。

当帐户可以在数据中有多个会话时,我如何将会话开始/结束时间分组到像这样的2列中?

尝试:

(df1.assign(idx=df1.event.eq('start').groupby(df1.account).cumsum())  # enumerate the `start` event by account
.loc[lambda x: x['event'].isin(['start','end'])]                  # only keep `start` and `end` events
.set_index(['idx','account','event'])                             # set index and unstack
['timestamp'].unstack()                               # then unstack
.reset_index(level=1).reset_index(drop=True)          # tidying up
)

输出:

event account                  end                start
0         bob  01-01-2021 12:05:00  01-01-2021 12:00:00
1        jane  01-01-2021 12:07:00  01-01-2021 12:03:00
2         ned                  NaN  01-01-2021 12:14:00
3        todd  01-01-2021 12:10:00  01-01-2021 12:08:00
4         bob  01-01-2021 12:13:00  01-01-2021 12:11:00

您可以识别启动并使用pivot_table:

(df1.assign(idx=df1['event'].eq('start').groupby(df1['account']).cumsum())
.pivot_table(index=['account', 'idx'], columns='event', values='timestamp', aggfunc='first')
[['start', 'end']]
)

输出:

event                      start                  end
account idx                                          
bob     1    01-01-2021 12:00:00  01-01-2021 12:05:00
2    01-01-2021 12:11:00  01-01-2021 12:13:00
jane    1    01-01-2021 12:03:00  01-01-2021 12:07:00
ned     1    01-01-2021 12:14:00                  NaN
todd    1    01-01-2021 12:08:00  01-01-2021 12:10:00

最新更新