我正在寻找一种方法来搜索大量相等长度的子字符串的大字符串。
我当前的方法基本上是这样的:
offset = 0
found = []
while offset < len(haystack):
current_chunk = haystack[offset*8:offset*8+8]
if current_chunk in needles:
found.append(current_chunk)
offset += 1
这是痛苦的慢。有没有更好的蟒蛇方法?
更python化,更快:
for needle in needles:
if needle in haystack:
found.append(needle)
编辑:经过一些有限的测试,这里是测试结果
这个算法:0.000135183334351
你的算法:0.984048128128
更快。
我认为您可以在多核上分解它并并行化您的搜索。类似以下语句:
from multiprocessing import Pool
text = "Your very long string"
"""
A generator function for chopping up a given list into chunks of
length n.
"""
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
def searchHaystack(haystack, needles):
offset = 0
found = []
while offset < len(haystack):
current_chunk = haystack[offset*8:offset*8+8]
if current_chunk in needles:
found.append(current_chunk)
offset += 1
return(needles)
# Build a pool of 8 processes
pool = Pool(processes=8,)
# Fragment the string data into 8 chunks
partitioned_text = list(chunks(text, len(text) / 8))
# Generate all the needles found
all_the_needles = pool.map(searchHaystack, partitioned_text, needles)