大型文本文件的并行计算



我试图在非常大的文本文件中找到一些拼写错误并纠正它们。基本上,我运行以下代码:

ocr = open("text.txt")
text = ocr.readlines()
clean_text = []
for line in text:
last = re.sub("^(\|)([0-9])(\s)([A-Z][a-z]+[a-z])\,",     "1\2t\3\4,",     line)
clean_text.append(last) 
new_text = open("new_text.txt", "w", newline="n") 
for line in clean_text:
new_text.write(line)
new_text.close()

实际上,我使用"re.sub"功能超过1500次,而"text.txt"有100.000行。 我可以将文本分成几部分并对不同的部分使用不同的内核吗?

这将应用文本处理函数(当前具有问题中的re.sub(来NUM_CORES输入文本文件的相同大小的块,然后将它们写出(保留原始文本输入文件的顺序(。

from multiprocessing import Pool, cpu_count
import numpy as np
import re
NUM_CORES = cpu_count()
def process_text(input_textlines):
clean_text = []
for line in input_textlines:
cleaned = re.sub("^(\|)([0-9])(\s)([A-Z][a-z]+[a-z])\,", "1\2t\3\4,", line)
clean_text.append(cleaned)
return "".join(clean_text)
# read in data and convert to sequence of equally-sized chunks
with open('data/text.txt', 'r') as f:
lines = f.readlines()
num_lines = len(lines)
text_chunks = np.array_split(lines, NUM_CORES)
# process each chunk in parallel
pool = Pool(NUM_CORES)
results = pool.map(process_text, text_chunks)
# write out results
with open("new_text.txt", "w", newline="n") as f:
for text_chunk in results:
f.write(text_chunk)

最新更新