使用多处理将列表项附加到dict并行化



我有一个包含字符串的大列表。我希望从这个列表中创建一个dict,这样:

list = [str1, str2, str3, ....]

dict = {str1:len(str1), str2:len(str2), str3:len(str3),.....}

我的解决方案是一个for循环,但它花费了太多时间(我的列表包含近1M个元素(:

for i in list:
d[i] = len(i) 

我希望使用python中的多处理模块,以利用所有核心并减少进程执行所需的时间。我遇到了一些粗糙的例子,涉及管理器模块在不同的流程之间共享dict,但无法实现。如果有任何帮助,我们将不胜感激!

我不知道使用多个进程是否会更快,但这是一个有趣的实验。

一般流程:

  • 创建随机单词列表
  • 将列表拆分为多个分段,每个进程一个分段
  • 运行进程,将段作为参数传递
  • 将结果词典合并到单个词典

试试这个代码:

import concurrent.futures
import random
from multiprocessing import Process, freeze_support

def todict(lst):
print(f'Processing {len(lst)} words')
return {e:len(e) for e in lst}  # convert list to dictionary   
if __name__ == '__main__':
freeze_support()  # needed for Windows

# create random word list - max 15 chars
letters = [chr(x) for x in range(65,65+26)] # A-Z
words = [''.join(random.sample(letters,random.randint(1,15))) for w in range(10000)] # 10000 words
words = list(set(words))  # remove dups, count will drop
print(len(words))

########################

cpucnt = 4  # process count to use

# split word list for each process
wl = len(words)//cpucnt + 1  # word count per process
lstsplit = []
for c in range(cpucnt):
lstsplit.append(words[c*wl:(c+1)*wl]) # create word list for each process
# start processes
with concurrent.futures.ProcessPoolExecutor(max_workers=cpucnt) as executor:
procs = [executor.submit(todict, lst) for lst in lstsplit]
results = [p.result() for p in procs]  # block until results are gathered

# merge results to single dictionary
dd = {}
for r in results:
dd.update(r)

print(len(dd))  # confirm match word count
with open('dd.txt','w') as f: f.write(str(dd)) # write dictionary to text file

最新更新