优化 Python 代码以读取文件

我有以下代码：代码 1：

logfile = open(logfile, 'r')
logdata = logfile.read()
logfile.close()
CurBeginA = BeginSearchDVar
CurEndinA = EndinSearchDVar
matchesBegin = re.search(str(BeginTimeFirstEpoch), logdata)
matchesEnd = re.search(str(EndinTimeFirstEpoch), logdata)
BeginSearchDVar = BeginTimeFirstEpoch
EndinSearchDVar = EndinTimeFirstEpoch

我在脚本的另一部分也有这段代码：代码 2

TheTimeStamps = [ x.split(' ')[0][1:-1] for x in open(logfile).readlines() ]

很明显，我正在加载日志文件两次。我想避免这种情况。无论如何，我可以在代码 2、代码 1 中做我正在做的事情吗？那么，这样，日志文件只加载一次？

在代码 1 中，我正在搜索日志以确保在不同行中找到两种非常具体的模式。

在代码 2 中，我只拉取日志文件中所有行的第一列。

如何更好地优化这一点？我正在当前大小为 480MB 的日志文件上运行它，脚本在大约 12 秒内完成。考虑到此日志的大小可以达到 1GB 甚至 2GB，我想使其尽可能高效。

更新：

因此，@abernert的代码有效。我继续向它添加了一个额外的逻辑，现在，它不再工作了。下面是我现在拥有的修改后的代码。我在这里基本上要做的是，如果在日志中找到 matchesBegin 和 matchesEnd 中的模式，那么，从 matchesBegin to matchesEnd 中搜索日志，并仅打印出包含字符串 A 和 stringB 的行：

matchesBegin, matchesEnd = None, None
beginStr, endStr = str(BeginTimeFirstEpoch).encode(), str(EndinTimeFirstEpoch).encode()
AllTimeStamps = []
mylist = []
with open(logfile, 'rb') as input_data:
def SearchFirst():
matchesBegin, matchesEnd = None, None
for line in input_data:
if not matchesBegin:
matchesBegin = beginStr in line
if not matchesEnd:
matchesEnd = endStr in line
return(matchesBegin, matchesEnd)
matchesBegin, matchesEndin = SearchFirst()
#print type(matchesBegin)
#print type(matchesEndin)
#if str(matchesBegin) == "True" and str(matchesEnd) == "True":
if matchesBegin is True and matchesEndin is True:
rangelines = 0
for line in input_data:
print line
if beginStr in line[0:25]:  # Or whatever test is needed
rangelines += 1
#print line.strip()
if re.search(stringA, line) and re.search(stringB, line):
mylist.append((line.strip()))
break
for line in input_data:  # This keeps reading the file
print line
if endStr in line[0:25]:
rangelines += 1
if re.search(stringA, line) and re.search(stringB, line):
mylist.append((line.strip()))
break
if re.search(stringA, line) and re.search(stringB, line):
rangelines += 1
mylist.append((line.strip()))
else:
rangelines += 1
#return(mylist,rangelines)
print(mylist,rangelines)
AllTimeStamps.append(line.split(' ')[0][1:-1])

我在上面的代码中做错了什么？

首先，几乎没有一个很好的理由来调用readlines()。文件已经是行的可迭代对象，因此您只需循环文件即可;将所有这些行读入内存并建立一个巨大的列表只是浪费时间和记忆。

另一方面，调用read()有时很有用。它确实必须将整个内容作为一个巨大的字符串读入内存，但是与逐行搜索相比，对一个巨大的字符串进行正则表达式搜索可以加快速度，浪费的时间和空间得到了补偿。

但是，如果您想将其减少为对文件的单次传递，因为您已经必须逐行迭代，因此实际上没有其他选择，只能逐行进行正则表达式搜索。这应该有效(你还没有显示你的模式，但根据名称，我猜它们不应该跨越线边界，也不是多线或点状图案(，但它实际上是更快还是更慢将取决于各种因素。

无论如何，当然值得一试，看看它是否有帮助。(而且，当我们讨论它时，我将使用with语句来确保您关闭文件，而不是像在第二部分中那样泄漏它。

CurBeginA = BeginSearchDVar
CurEndinA = EndinSearchDVar
BeginSearchDVar = BeginTimeFirstEpoch
EndinSearchDVar = EndinTimeFirstEpoch    
matchesBegin, matchesEnd = None, None
TheTimeStamps = []
with open(logfile) as f:
for line in f:
if not matchesBegin:
matchesBegin = re.search(str(BeginTimeFirstEpoch), line)
if not matchesEnd:
matchesEnd = re.search(str(EndinTimeFirstEpoch), line)
TheTimeStamps.append(line.split(' ')[0][1:-1])

您可以在此处进行一些其他小更改，这些更改可能会有所帮助。

我不知道BeginTimeFirstEpoch是什么，但你正在使用str(BeginTimeFirstEpoch)的事实意味着它根本不是正则表达式模式，而是类似于datetime对象或int的东西？而且你真的不需要匹配对象，你只需要知道是否有匹配？如果是这样，您可以删除regex并执行普通的子字符串搜索，这会更快一些：

matchesBegin, matchesEnd = None, None
beginStr, endStr = str(BeginTimeFirstEpoch), str(EndinTimeFirstEpoch)
with …
# …
if not matchesBegin:
matchesBegin = beginStr in line
if not matchesEnd:
matchesEnd = endStr in line

如果您的搜索字符串和时间戳等都是纯 ASCII，则在二进制模式下处理文件可能会更快，只解码您需要存储的位，而不是所有内容：

matchesBegin, matchesEnd = None, None
beginStr, endStr = str(BeginTimeFirstEpoch).encode(), str(EndinTimeFirstEpoch).encode()
with open(logFile, 'rb') as f:
# …
if not matchesBegin:
matchesBegin = beginStr in line
if not matchesEnd:
matchesEnd = endStr in line
TheTimeStamps.append(line.split(b' ')[0][1:-1].decode())

最后，我怀疑str.split您的代码中是否存在瓶颈，但是，以防万一......当我们只想要第一次拆分时，为什么要在所有空间上拆分？

TheTimeStamps.append(line.split(b' ', 1)[0][1:-1].decode())

相关内容

最新更新

热门标签：