using pandas Series.rolling with DateOffset

Python，Pandas，数据分析在这里。

所以我要做的是从一大组 Apache 服务器日志中识别最繁忙的 60 分钟时间间隔。我已经将日志中的时间戳提取到列表中。

time_recieved是一个具有类似值的列表

[
1995-07-01T00:01:18-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:01:19-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:11:45-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:43-04:00,
1995-07-01T00:13:46-04:00,
1995-07-01T00:13:47-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:48-04:00,
1995-07-01T00:13:50-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:13:53-04:00,
1995-07-01T00:14:11-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:17-04:00,
1995-07-01T00:14:18-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:20-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:21-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:22-04:00,
1995-07-01T00:14:23-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:24-04:00,
1995-07-01T00:14:26-04:00,
1995-07-01T00:14:27-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:30-04:00,
1995-07-01T00:14:31-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:32-04:00,
1995-07-01T00:14:36-04:00,
]

我的目标是沿着这个时间戳列表，我将能够从其中任何一个点开始计算 60 分钟间隔。一旦我启动滚动窗口，我想我可以处理它。

在熊猫文档中： http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html 我找到了以下关于窗口参数的条目 " 窗口：整数或偏移量移动窗口的大小。这是用于计算统计量的观测值数。每个窗口的大小都是固定的。如果是偏移量，那么这将是每个窗口的时间段。每个窗口的大小将根据时间段中包含的观测值进行可变。这仅对类似日期时间的索引有效。这是 0.19.0 中的新功能 ">

我正在使用熊猫 19.2 根据时间段内的观察结果使用可变大小的窗口的选项听起来正是我想要的。所以我尝试实现它：

import pandas as pd
from pandas.tseries.offsets import DateOffset
def busiest_timeframe(data,timeframe = 60):    
time_window = DateOffset(minutes = 60)
print (type(time_window))
series = pd.Series(data)
series.rolling(time_window).count()
return series  
busiest_tf = busiest_timeframe(time_received)

我收到以下错误：提高 ValueError("窗口必须是整数")

ValueError: window must be an integer

我正在使用其他偏移对象吗？这个熊猫功能不起作用吗？我是否误解了文档？

提前感谢您的帮助和建议！

尝试使用偏移别名而不是 DateOffset：

http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

文档中的示例：

import pandas as pd
import numpy as np
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
index = [pd.Timestamp('20130101 09:00:00'),
pd.Timestamp('20130101 09:00:02'),
pd.Timestamp('20130101 09:00:03'),
pd.Timestamp('20130101 09:00:05'),
pd.Timestamp('20130101 09:00:06')])
print(df.rolling('2s').count())

输出：

B
2013-01-01 09:00:00  1.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  2.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  1.0

可悲的是，我不知道如何使用series.rolling，似乎您没有将其设置为索引，这就是它不起作用的原因。但即便如此，我还是遇到了错误，所以这里有另一种选择(也许真的很丑陋)，所以如果别人有更好的方法，最好听别人的话。

所以是的，它使用布尔索引。使用代码(大量打印语句)，如果需要，可以将>=和<=更改为>和<。

liste=[
"1995-07-01T00:01:18-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:01:19-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:11:45-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:43-04:00",
"1995-07-01T00:13:46-04:00",
"1995-07-01T00:13:47-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:48-04:00",
"1995-07-01T00:13:50-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:13:53-04:00",
"1995-07-01T00:14:11-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:17-04:00",
"1995-07-01T00:14:18-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:20-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:21-04:00",
"1995-07-01T00:14:22-04:00",
"1995-07-01T00:14:22-04:00",
"1995-07-01T00:14:23-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:24-04:00",
"1995-07-01T00:14:26-04:00",
"1995-07-01T00:14:27-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:30-04:00",
"1995-07-01T00:14:31-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:32-04:00",
"1995-07-01T00:14:36-04:00"
]
import pandas as pd
from pandas.tseries.offsets import DateOffset
def busiest_timeframe(data,timeframe = 1):
series = pd.to_datetime(pd.Series(data), format='%Y-%m-%dT%H:%M:%S') #maybe you dont need the to_datetime here. I did.
df=series.to_frame(name="time")
df["count"]=[df[(df["time"] >= x) & (df["time"] <= (x+pd.Timedelta(seconds=timeframe)))].size for x in df["time"].values] #change seconds to minutes or whatever you want
highest_index=df["count"].idxmax()
#print(df.ix[highest_index]["time"])
df2=df[(df["time"] >= df.ix[highest_index]["time"]) & (df["time"] <= (df.ix[highest_index]["time"]+pd.Timedelta(seconds=timeframe)))] #change seconds here to th same as above
return df2
print(busiest_timeframe(liste))

相关内容

最新更新

热门标签：