我有一个包含字符串的大列表。我希望从这个列表中创建一个dict,这样:
list = [str1, str2, str3, ....]
dict = {str1:len(str1), str2:len(str2), str3:len(str3),.....}
我的解决方案是一个for循环,但它花费了太多时间(我的列表包含近1M个元素(:
for i in list:
d[i] = len(i)
我希望使用python中的多处理模块,以利用所有核心并减少进程执行所需的时间。我遇到了一些粗糙的例子,涉及管理器模块在不同的流程之间共享dict,但无法实现。如果有任何帮助,我们将不胜感激!
我不知道使用多个进程是否会更快,但这是一个有趣的实验。
一般流程:
- 创建随机单词列表
- 将列表拆分为多个分段,每个进程一个分段
- 运行进程,将段作为参数传递
- 将结果词典合并到单个词典
试试这个代码:
import concurrent.futures
import random
from multiprocessing import Process, freeze_support
def todict(lst):
print(f'Processing {len(lst)} words')
return {e:len(e) for e in lst} # convert list to dictionary
if __name__ == '__main__':
freeze_support() # needed for Windows
# create random word list - max 15 chars
letters = [chr(x) for x in range(65,65+26)] # A-Z
words = [''.join(random.sample(letters,random.randint(1,15))) for w in range(10000)] # 10000 words
words = list(set(words)) # remove dups, count will drop
print(len(words))
########################
cpucnt = 4 # process count to use
# split word list for each process
wl = len(words)//cpucnt + 1 # word count per process
lstsplit = []
for c in range(cpucnt):
lstsplit.append(words[c*wl:(c+1)*wl]) # create word list for each process
# start processes
with concurrent.futures.ProcessPoolExecutor(max_workers=cpucnt) as executor:
procs = [executor.submit(todict, lst) for lst in lstsplit]
results = [p.result() for p in procs] # block until results are gathered
# merge results to single dictionary
dd = {}
for r in results:
dd.update(r)
print(len(dd)) # confirm match word count
with open('dd.txt','w') as f: f.write(str(dd)) # write dictionary to text file