编辑1:打印示例代码的输出
使用pandas在一组syslog文件中查找间隙。下面的例子给了我一个True/False值,如果real_date
的值相差超过我想要容忍的120秒(第4029行,30行)
我发现syslog中的大量条目在时间戳中可能有轻微的不准确性(第4027、4028行)。我猜,因为第4028行是比第4027行更早的值,所以比较正确地最终为真。
我怎样才能让它容忍1秒的差异呢?
import pandas as pd
import io
maximum_tolerated_gap_in_seconds = 120
gaps_to_ignore = 1
seconds_per_day = 86400
df[column_name] = pd.to_datetime(df[column_name], errors='coerce')
df.dropna(inplace=True, subset=[column_name])
df['time_difference'] = df[column_name].diff().dt.seconds
df['test_gap'] = (df[column_name].diff().dt.seconds) <= maximum_tolerated_gap_in_seconds
test_data = """"line_number","line_length","syslog_date","unix_timestamp","real_date"
4026,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4027,97,"Jul 19 01:00:02",1626656402.0,"2021-07-19T01:00:02"
4028,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4029,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4030,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:30:00"
"""
test_data_file = io.StringIO(test_data)
df = pd.read_csv(test_data_file)
column_name = "real_date"
df.head()
输出:
line_number line_length syslog_date unix_timestamp real_date time_difference test_gap
0 4026 110 Jul 19 01:00:01 1.626656e+09 2021-07-19 01:00:01 NaN False
1 4027 97 Jul 19 01:00:02 1.626656e+09 2021-07-19 01:00:02 1.0 True
2 4028 110 Jul 19 01:00:01 1.626656e+09 2021-07-19 01:00:01 86399.0 False
3 4029 110 Jul 19 01:00:01 1.626656e+09 2021-07-19 01:00:01 0.0 True
4 4030 110 Jul 19 01:00:01 1.626656e+09 2021-07-19 01:30:00 1799.0 False
因此,当第4027行时间戳比下一行4028早1秒时,比较值为86399.0,大于max_tolerated_gap_in_seconds
。
我猜:当它只有1秒,但由于无序时间显示为86399秒时,我如何忽略它?
编辑2:
这是用adr
建议更改为使用dt.total_seconds()
更新的代码:
import pandas as pd
import io
maximum_tolerated_gap_in_seconds = 120
gaps_to_ignore = 1
seconds_per_day = 86400
test_data = """"line_number","line_length","syslog_date","unix_timestamp","real_date"
4026,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4027,97,"Jul 19 01:00:02",1626656402.0,"2021-07-19T01:00:02"
4028,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4029,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:00:01"
4030,110,"Jul 19 01:00:01",1626656401.0,"2021-07-19T01:30:00"
"""
test_data_file = io.StringIO(test_data)
df = pd.read_csv(test_data_file)
column_name = "real_date"
df[column_name] = pd.to_datetime(df[column_name], errors='coerce')
df.dropna(inplace=True, subset=[column_name])
df['time_difference'] = df[column_name].diff().dt.total_seconds()
df['test_gap'] = (df[column_name].diff().dt.total_seconds()) <= maximum_tolerated_gap_in_seconds
print(df.head())
和输出:
$ python sample.py
line_number line_length syslog_date unix_timestamp real_date time_difference test_gap
0 4026 110 Jul 19 01:00:01 1.626656e+09 2021-07-19 01:00:01 NaN False
1 4027 97 Jul 19 01:00:02 1.626656e+09 2021-07-19 01:00:02 1.0 True
2 4028 110 Jul 19 01:00:01 1.626656e+09 2021-07-19 01:00:01 -1.0 True
3 4029 110 Jul 19 01:00:01 1.626656e+09 2021-07-19 01:00:01 0.0 True
4 4030 110 Jul 19 01:00:01 1.626656e+09 2021-07-19 01:30:00 1799.0 False
很好。
访问器dt.seconds
总是给您一个正的时间值。为了实现这一目标,日子将变得很难熬。试试dt.total_seconds()
吧。