如何在 Python 多处理中的所有进程之间共享数据



我想在给定文章中搜索预定义的关键字列表,如果在文章中找到关键字,则将分数增加 1。我想使用多处理,因为预定义的关键字列表非常大 - 10k 关键字和文章数量为 100k。

遇到了这个问题,但它没有解决我的问题。

我尝试了这个实现,但结果None

keywords = ["threading", "package", "parallelize"]
def search_worker(keyword):
    score = 0
    article = """
    The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
   if keyword in article:
        score += 1
    return score

我尝试了以下两种方法,但结果得到了三种None

方法1:

 pool = mp.Pool(processes=4)
 result = [pool.apply(search_worker, args=(keyword,)) for keyword in keywords]

方法2:

result = pool.map(search_worker, keywords)
print(result)

实际输出:[无、无、无]

预期输出: 3

我想向工人发送预定义的关键字列表和文章,但我不确定我是否朝着正确的方向前进,因为我之前没有多处理的经验。

提前谢谢。

这是一个使用 Pool 的函数。您可以传递文本和keyword_list它将起作用。你可以使用Pool.starmap来传递(text, keyword)元组,但你需要处理一个对text有10k引用的迭代对象。

from functools import partial
from multiprocessing import Pool
def search_worker(text, keyword):
    return int(keyword in text)
def parallel_search_text(text, keyword_list):
    processes = 4
    chunk_size = 10
    total = 0
    func = partial(search_worker, text)
    with Pool(processes=processes) as pool:
        for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
            total += result
    return total
if __name__ == '__main__':
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text(text, keywords))

创建工作线程池会产生开销。可能值得针对简单的单进程文本搜索功能对此进行时间测试。 可以通过创建一个Pool实例并将其传递到函数中来加快重复调用的速度。

def parallel_search_text2(text, keyword_list, pool):
    chunk_size = 10
    results = 0
    func = partial(search_worker, text)
    for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
        results += result
    return results
if __name__ == '__main__':
    pool = Pool(processes=4)
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text2(text, keywords, pool))

用户e.s在他的评论中解决了主要问题,但我正在发布一个解决方案,以Om Prakash的评论请求传入:

文章和预定义的关键字列表到工作方法的方法

这是一种简单的方法。您需要做的就是构造一个元组,其中包含您希望工作线程处理的参数:

from multiprocessing import Pool
def search_worker(article_and_keyword):
    # unpack the tuple
    article, keyword = article_and_keyword
    # count occurrences
    score = 0
    if keyword in article:
        score += 1
    return score
if __name__ == "__main__":
    # the article and the keywords
    article = """The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
    keywords = ["threading", "package", "parallelize"]
    # construct the arguments for the search_worker; one keyword per worker but same article
    args = [(article, keyword) for keyword in keywords]
    # construct the pool and map to the workers
    with Pool(3) as pool:
        result = pool.map(search_worker, args)
    print(result)

如果您使用的是更高版本的python,我建议您尝试starmap因为这会使它更干净一些。

相关内容

  • 没有找到相关文章

最新更新