在干草堆中搜索几个长度相等的针(Python)



我正在寻找一种方法来搜索大量相等长度的子字符串的大字符串。

我当前的方法基本上是这样的:

offset = 0
found = []
while offset < len(haystack):
  current_chunk = haystack[offset*8:offset*8+8]
  if current_chunk in needles:
     found.append(current_chunk)
  offset += 1

这是痛苦的慢。有没有更好的蟒蛇方法?

更python化,更快:

for needle in needles:
    if needle in haystack:
        found.append(needle)
编辑:经过一些有限的测试,这里是测试结果

这个算法:0.000135183334351

你的算法:0.984048128128

更快。

我认为您可以在多核上分解它并并行化您的搜索。类似以下语句:

from multiprocessing import Pool
text = "Your very long string"
"""
A generator function for chopping up a given list into chunks of
length n.
"""
def chunks(l, n):
  for i in xrange(0, len(l), n):
    yield l[i:i+n]
def searchHaystack(haystack, needles):
    offset = 0
    found = []
    while offset < len(haystack):
      current_chunk = haystack[offset*8:offset*8+8]
      if current_chunk in needles:
      found.append(current_chunk)
      offset += 1
    return(needles)
# Build a pool of 8 processes
pool = Pool(processes=8,)
# Fragment the string data into 8 chunks
partitioned_text = list(chunks(text, len(text) / 8))
# Generate all the needles found
all_the_needles = pool.map(searchHaystack, partitioned_text, needles)

最新更新