python处理嵌套groupby的最佳方式

我目前正在尝试使用python和pandas库处理一些日志文件。日志包含有关发送到服务器的请求的简单信息，我想从中提取有关会话的信息。这里的会话定义为同一用户在特定时间段内(例如30分钟，从第一次请求到最后一次请求的时间计算，此时间段后的请求应视为新会话的一部分)发出的一组请求。

要做到这一点，目前我正在执行嵌套分组:首先我使用groupby来获取每个用户的请求，然后按30分钟的间隔对每个用户的请求进行分组，最后迭代这些间隔并选择那些实际包含数据的请求:

# example log entry:
# id,host,time,method,url,response,bytes
# 303372,XXX.XXX.XXX.XXX,1995-07-11 12:17:09,GET,/htbin/wais.com?IMAX,200,6923
by_host = logs.groupby('host', sort=False)
for host, frame in by_host:
by_frame = frame.groupby(pd.Grouper(key='time', freq='30min', origin='start'))
for date, logs in by_frame:
if not logs.empty and logs.shape[0] > 1:
session_calculations()

这当然是相当低效的，并且使计算花费相当多的时间。有什么方法可以优化这个过程吗?我没能想出任何成功的办法。

编辑:

host                time method                                           url  response  bytes
0          ***.novo.dk 1995-07-11 12:17:09    GET                                     /ksc.html       200   7067
1          ***.novo.dk 1995-07-11 12:17:48    GET               /shuttle/missions/missions.html       200   8678
2          ***.novo.dk 1995-07-11 12:23:10    GET     /shuttle/resources/orbiters/columbia.html       200   6922
3          ***.novo.dk 1995-08-09 12:48:48    GET  /shuttle/missions/sts-69/mission-sts-69.html       200  11264
4          ***.novo.dk 1995-08-09 12:49:48    GET               /shuttle/countdown/liftoff.html       200   4665

和预期结果是从请求中提取的会话列表:

host session_time
0  ***.novo.dk 00:06:01 
1  ***.novo.dk 00:01:00

注意这里的session_time是来自input的第一个和最后一个请求之间的时间差，将它们分组在30分钟的时间窗口中。

为每个用户定义本地时间窗口，即将原点视为每个用户第一次请求的时间，您可以首先按'host'分组。然后使用GroupBy.apply对每个用户的DataFrame应用一个函数，该函数处理时间分组并计算用户会话的持续时间。

def session_duration_by_host(by_host):
time_grouper = pd.Grouper(key='time', freq='30min', origin='start')
duration = lambda time: time.max() - time.min()
return ( 
by_host.groupby(time_grouper)
.agg(session_time = ('time', duration))
)
res = (
logs.groupby("host")
.apply(session_duration_by_host)
.reset_index()
.drop(columns="time")
)

# You have to write idiomatic Pandas code, so rather then processing something -> saving into variable -> using that variable (only once) to something -> ....  you have to chain your process. Also pandas `apply` is much faster than normal `for` in most situations.
logs.groupby('host', sort=False).apply(
lambda by_frame:by_frame.groupby(
pd.Grouper(key='time', freq='30min', origin='start')
).apply(lambda logs: session_calculations() if (not logs.empty) and (logs.shape[0] > 1) else None)
)

相关内容

最新更新

热门标签：