给熊猫的间距增加一个公差



编辑1:打印示例代码的输出

使用pandas在一组syslog文件中查找间隙。下面的例子给了我一个True/False值,如果real_date的值相差超过我想要容忍的120秒(第4029行,30行)

我发现syslog中的大量条目在时间戳中可能有轻微的不准确性(第4027、4028行)。我猜,因为第4028行是比第4027行更早的值,所以比较正确地最终为真。

我怎样才能让它容忍1秒的差异呢?

import pandas as pd
import io
maximum_tolerated_gap_in_seconds = 120
gaps_to_ignore = 1
seconds_per_day = 86400
df[column_name] = pd.to_datetime(df[column_name], errors='coerce')
df.dropna(inplace=True, subset=[column_name])
df['time_difference'] = df[column_name].diff().dt.seconds
df['test_gap'] = (df[column_name].diff().dt.seconds) <= maximum_tolerated_gap_in_seconds
test_data = """"line_number","line_length","syslog_date","unix_timestamp","real_date"
4026,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4027,97,"Jul 19 01:00:02",1626656402.0,"2021-07-19T01:00:02"
4028,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4029,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4030,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:30:00"
"""
test_data_file = io.StringIO(test_data)
df = pd.read_csv(test_data_file)
column_name = "real_date"
df.head()

输出:

line_number line_length syslog_date unix_timestamp  real_date   time_difference test_gap
0   4026    110 Jul 19 01:00:01 1.626656e+09    2021-07-19 01:00:01 NaN False
1   4027    97  Jul 19 01:00:02 1.626656e+09    2021-07-19 01:00:02 1.0 True
2   4028    110 Jul 19 01:00:01 1.626656e+09    2021-07-19 01:00:01 86399.0 False
3   4029    110 Jul 19 01:00:01 1.626656e+09    2021-07-19 01:00:01 0.0 True
4   4030    110 Jul 19 01:00:01 1.626656e+09    2021-07-19 01:30:00 1799.0  False

因此,当第4027行时间戳比下一行4028早1秒时,比较值为86399.0,大于max_tolerated_gap_in_seconds

我猜:当它只有1秒,但由于无序时间显示为86399秒时,我如何忽略它?

编辑2:

这是用adr建议更改为使用dt.total_seconds()更新的代码:

import pandas as pd
import io
maximum_tolerated_gap_in_seconds = 120
gaps_to_ignore = 1
seconds_per_day = 86400
test_data = """"line_number","line_length","syslog_date","unix_timestamp","real_date"
4026,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4027,97,"Jul 19 01:00:02",1626656402.0,"2021-07-19T01:00:02"
4028,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4029,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4030,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:30:00"
"""
test_data_file = io.StringIO(test_data)
df = pd.read_csv(test_data_file)
column_name = "real_date"
df[column_name] = pd.to_datetime(df[column_name], errors='coerce')
df.dropna(inplace=True, subset=[column_name])
df['time_difference'] = df[column_name].diff().dt.total_seconds()
df['test_gap'] = (df[column_name].diff().dt.total_seconds()) <= maximum_tolerated_gap_in_seconds
print(df.head())

和输出:

$ python sample.py
line_number  line_length      syslog_date  unix_timestamp           real_date  time_difference  test_gap
0         4026          110  Jul 19 01:00:01    1.626656e+09 2021-07-19 01:00:01              NaN     False
1         4027           97  Jul 19 01:00:02    1.626656e+09 2021-07-19 01:00:02              1.0      True
2         4028          110  Jul 19 01:00:01    1.626656e+09 2021-07-19 01:00:01             -1.0      True
3         4029          110  Jul 19 01:00:01    1.626656e+09 2021-07-19 01:00:01              0.0      True
4         4030          110  Jul 19 01:00:01    1.626656e+09 2021-07-19 01:30:00           1799.0     False

很好。

访问器dt.seconds总是给您一个正的时间值。为了实现这一目标,日子将变得很难熬。试试dt.total_seconds()吧。

最新更新