使用python并行查找字符串中的单词列表

我知道这个问题在不同的地方被回答了好几次，但我正在努力寻找并行的事情。我从Python中得到了这个答案：如何确定@Aaron Hall回答的字符串中是否存在单词列表。它运行得很好，但问题是，当我想使用ProcessPoolExecutor或ThreadPoolExecutor在parrlel中运行相同的代码段时，速度非常慢。正常执行需要0.22秒来处理119288行，但使用ProcessPoolExecutor需要93秒。我不明白这个问题，代码片段在这里。

def multi_thread_execute(): # this takes 93 seconds
lines = get_lines()
print("got {} lines".format(len(lines)))
futures = []
my_word_list = ['banking', 'members', 'based', 'hardness']
with ProcessPoolExecutor(max_workers=10) as pe:
for line in lines:
ff = pe.submit(words_in_string,my_word_list, line)
futures.append(ff)
results = [f.result() for f in futures]

单线程耗时0.22秒。

my_word_list = ['banking', 'members', 'based', 'hardness']
lines = get_lines()
for line in lines:
result = words_in_string(my_word_list, line)

我有50GB+的单个文件(谷歌5gram文件(，并行读取行这工作得很好，但以上多线程太慢了。是GIL的问题吗。我怎样才能提高表现。

文件样本格式(单个文件50+GB，总数据为3TB(

n.p. : The Author , 2005    1   1
n.p. : The Author , 2006    7   2
n.p. : The Author , 2007    1   1
n.p. : The Author , 2008    2   2
NP if and only if   1977    1   1
NP if and only if   1980    1   1
NP if and only if   1982    3   2

Python是一种通常没有强大的多线程用例的语言，可以在这个StackOverflow问题中阅读更多关于为什么

相关内容

最新更新

热门标签：