我对Python还很陌生,我在这里找到了大多数问题的答案,但这一点让我很困惑。
我使用Python处理日志文件,通常每一行都以日期/时间戳开头,例如:
[1/4/13 18:37:37:848 PST]
在99%的情况下,我可以逐行读取,查找感兴趣的项目并相应地处理它们,但偶尔日志文件中的一个条目会包含一条包含回车/换行字符的消息,因此它会跨越多行。
有没有一种方法可以让我在"时间戳之间"轻松读取文件,这样当这种情况发生时,多行将合并为一行读取?例如:
[1/4/13 18:37:37:848 PST] A log entry
[1/4/13 18:37:37:848 PST] Another log entry
[1/4/13 18:37:37:848 PST] A log entry that somehow
got some new line
characters mixed in
[1/4/13 18:37:37:848 PST] The last log entry
会读成四行,而不是现在的六行。
提前感谢您的帮助。
Chris,
更新。。。。
myTestFile.log包含上面的确切文本,这是我的脚本:
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/myTestFile.log"
lines = []
def timestamp_split(file):
pattern = re.compile("[(0?[1-9]|[12][0-9]|3[01])(/)(0?[1-9]|[12][0-9]|3[01])(/)([0-9]{2})( )")
current = []
for line in file:
if not re.match(pattern,line):
if current:
yield "".join(current)
current == [line]
else:
current.append(line)
yield "".join(current)
print "--- START ----"
with open(logFileName) as file:
for entry in timestamp_split(file):
print entry
print "- Record Separator -"
print "--- DONE ----"
当我运行它时,我会得到这个:
--- START ----
[1/4/13 18:37:37:848 PST] A log entry
[1/4/13 18:37:37:848 PST] Another log entry
[1/4/13 18:37:37:848 PST] A log entry that somehow
- Record Separator -
[1/4/13 18:37:37:848 PST] A log entry
[1/4/13 18:37:37:848 PST] Another log entry
[1/4/13 18:37:37:848 PST] A log entry that somehow
- Record Separator -
[1/4/13 18:37:37:848 PST] A log entry
[1/4/13 18:37:37:848 PST] Another log entry
[1/4/13 18:37:37:848 PST] A log entry that somehow
[1/4/13 18:37:37:848 PST] The last log entry
- Record Separator -
--- DONE ----
我似乎重复了太多次,我所期待(希望(的是:
--- START ----
[1/4/13 18:37:37:848 PST] A log entry
- Record Separator -
[1/4/13 18:37:37:848 PST] Another log entry
- Record Separator -
[1/4/13 18:37:37:848 PST] A log entry that somehow got some new line characters mixed in
- Record Separator -
[1/4/13 18:37:37:848 PST] The last log entry
- Record Separator -
--- DONE ----
正如评论中所讨论的,在与测试时的regex模式进行比较时,我不小心留下了而不是,如果我删除了它,那么我会得到所有的部分行,这让我更加困惑!
--- START ----
got some new line
characters mixed in
- Record Separator -
got some new line
characters mixed in
- Record Separator -
--- DONE ----
实现这一点的最简单方法是实现一个简单的生成器:
def timestamp_split(file):
current = []
for line in file:
if line.startswith("["):
if current:
yield "".join(current)
current == [line]
else:
current.append(line)
yield "".join(current)
当然,这假设一行开头的"["
足以表示时间戳——您可能需要进行更重要的检查。
然后做一些类似的事情:
with open("somefile.txt") as file:
for entry in timestamp_split(file):
...
(此处使用with
语句-打开文件的良好做法。(
import re
lines = []
pattern = re.compile('[d+/d+/d+sd+:d+:d+sw+]')
with open('filename.txt', 'r') as f:
for line in f:
if re.match(pattern, line):
lines.append(line)
else:
lines[-1] += line
这将时间戳与正则表达式相匹配。可以根据需要进行调整。它还假设第一行包含一个时间戳。