如何防止多访问.pool占用我所有的内存

我的多处理池(8核，16 GB RAM(在摄取大量数据之前，已经使用了我所有的内存。我在6 GB数据集上操作

我尝试过使用各种类型的处理器，包括imap、imap_无序、apply、map等。我还尝试过maxtasksperchild，它似乎可以增加内存使用量。

import string
import re
import multiprocessing as mp
from tqdm import tqdm
linkregex = re.compile(r"httpS+")
puncregex = re.compile(r"(?<=w)[^sw](?![^sw])")
emojiregex = re.compile(r"(u00a9|u00ae|[u2000-u3300]|ud83c[ud000-udfff]|ud83d[ud000-udfff]|ud83e[ud000-udfff])")
sentences = []
def process(item):
return re.sub(emojiregex, r" 1 ", re.sub(puncregex,"",re.sub(linkregex, "link", item))).lower().split()

if __name__ == '__main__':
with mp.Pool(8) as pool:
sentences = list(tqdm(pool.imap_unordered(process, open('scrape/output.txt')
), total=52123146))
print(str(len(sentences)))
with open("final/word2vectweets.txt", "a+") as out:
out.write(sentences)

这应该返回文件中已处理行的列表，但它消耗内存的速度太快。没有mp和更简单处理的早期版本已经取得了成功。

这看起来怎么样？

import re
import multiprocessing as mp
linkregex = re.compile(r"httpS+")
puncregex = re.compile(r"(?<=w)[^sw](?![^sw])")
emojiregex = re.compile(r"(u00a9|u00ae|[u2000-u3300]|ud83c[ud000-udfff]|ud83d[ud000-udfff]|ud83e[ud000-udfff])")

def process(item):
return re.sub(emojiregex, r" 1 ", re.sub(puncregex,"",re.sub(linkregex, "link", item))).lower().split()

with mp.Pool() as pool, open(in_file_path, 'r') as file_in, open(out_file_path, 'a') as file_out:
for curr_sentence in pool.imap_unordered(process_line, file_in, chunksize=1000):
file_out.write(f'{curr_sentence}n')

我测试了一堆大块大小，1000似乎是最佳点。我会继续调查。

相关内容

最新更新

热门标签：