这是我到目前为止的例子:
import io
import mmap
import os
import time
# 20 MB test file
filename = "random.bin"
if not os.path.isfile(filename):
with open(filename, "wb") as f:
for _ in range(20):
f.write(os.urandom(1_000_000))
signature = b"x01x02x03x04"
print("Method 1:")
start_time = time.time()
offsets = []
with open(filename, "rb") as f:
buf = b"x00" + f.read(len(signature) - 1)
for offset, byte in enumerate(iter(lambda: f.read(1), b"")):
buf = buf[1:] + byte
if buf == signature:
offsets.append(offset)
print(f"{time.time() - start_time:.2f} seconds")
print(offsets)
print("Method 2:")
start_time = time.time()
offsets = []
with open(filename, "rb") as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
for offset in range(len(mm)):
if mm[offset : offset + len(signature)] == signature:
offsets.append(offset)
print(f"{time.time() - start_time:.2f} seconds")
print(offsets)
print("Method 3:")
start_time = time.time()
offsets = []
with open(filename, "rb") as f:
offset = 0
chunk = f.read(len(signature) - 1)
while True:
chunk = chunk[-(len(signature) - 1) :] + f.read1(io.DEFAULT_BUFFER_SIZE)
if len(chunk) < len(signature):
# EOF
break
for i in range(len(chunk) - (len(signature) - 1)):
if chunk[i : i + len(signature)] == signature:
offsets.append(offset + i)
offset += len(chunk) - (len(signature) - 1)
print(f"{time.time() - start_time:.2f} seconds")
print(offsets)
我正在搜索一个20 MB的测试文件,寻找4字节的signature
。使用mmap.mmap
切换到方法2节省了50%的运行时间,但仍然很慢。特别是因为我的实际目标文件将在1到10 GB之间。(这就是为什么我没有首先将整个文件加载到内存中。)它比md5sum random.bin
慢几个数量级。
编辑:我添加了另一种方法,它不使用f.read(1)
,但读取io.DEFAULT_BUFFER_SIZE
的块,还使用read1
来防止任何阻塞。但它仍然没有更快。
读取文件不是问题。使用块代替read(1)
肯定会足够快。然而,之后,用切片迭代字节显然是一个坏主意。bytes.find
比快得多。我还不明白为什么,但我已经发布了一个新的问题,str.find是如何如此之快?。
bytes.find
:
print("Method 4:")
start_time = time.time()
offsets = []
def _find(haystack, needle, start=0, offset=0):
while True:
position = haystack.find(needle, start)
if position < 0:
return
start = position + 1
yield position + offset
with open(filename, "rb") as f:
offset = 0
chunk = f.read(len(signature) - 1)
while True:
chunk = chunk[-(len(signature) - 1) :] + f.read1(io.DEFAULT_BUFFER_SIZE)
if len(chunk) < len(signature):
# EOF
break
offsets.extend(_find(chunk, signature, offset=offset))
offset += len(chunk) - (len(signature) - 1)
print(f"{time.time() - start_time:.2f} seconds")
print(offsets)
它产生以下时间测量值:
Method 1:
8.83 seconds
[20971596, 20971686]
Method 2:
5.69 seconds
[20971596, 20971686]
Method 3:
5.76 seconds
[20971596, 20971686]
Method 4:
0.02 seconds
[20971596, 20971686]
方法4显然是赢家。👍