Python tqdm process_map:进程之间共享的附加列表

我想共享一个列表来附加并行线程的输出，该线程由process_map从tqdm启动。(我想使用process_map的原因是很好的进度指示器和max_workers=选项。)

我曾尝试使用from multiprocessing import Manager创建共享列表，但我在这里做错了：我的代码打印了一个空的shared_list，但它应该打印一个包含20个数字的列表，正确的顺序并不重要。

如有任何帮助，我们将不胜感激，提前感谢！

import time
from tqdm.contrib.concurrent import process_map
from multiprocessing import Manager

shared_list = []
def worker(i):
global shared_list
time.sleep(1)
shared_list.append(i)
if __name__ == '__main__':
manager = Manager()
shared_list = manager.list()
process_map(worker, range(20), max_workers=5)
print(shared_list)

您没有指定在哪个平台下运行(当您用multiprocessing标记问题时，应该用您的平台标记问题)，但您似乎是在使用spawn创建新进程的平台(如Windows)下运行的。这意味着，当启动一个新进程时，会创建一个空的地址空间，启动新的Python解释器，并从顶部重新执行源代码。

因此，尽管在开始if __name__ == '__main__':的块中为shared_list分配了一个托管列表，但创建的池中的每个进程都将执行shared_list = []，从而破坏您的初始分配。

您可以将shared_list作为第一个参数传递给工作函数：

import time
from tqdm.contrib.concurrent import process_map
from multiprocessing import Manager
from functools import partial
def worker(shared_list, i):
time.sleep(1)
shared_list.append(i)
if __name__ == '__main__':
manager = Manager()
shared_list = manager.list()
process_map(partial(worker, shared_list), range(20), max_workers=5)
print(shared_list)

如果process_map以与ProcessPoolExecutor类相同的方式支持初始值设定项和initargs参数(似乎不支持)，则可以执行：

import time
from tqdm.contrib.concurrent import process_map
from multiprocessing import Manager
def init_pool(the_list):
global shared_list
shared_list = the_list
def worker(i):
time.sleep(1)
shared_list.append(i)
if __name__ == '__main__':
manager = Manager()
shared_list = manager.list()
process_map(worker, range(20), max_workers=5, initializer=init_pool, initargs=(shared_list,))
print(shared_list)

这与本身与您的原始问题无关，但对于这种类型的问题，您可能需要考虑使用托管列表，而不是工作函数(巧合地命名为worker)将元素附加到其中的托管列表，并且元素的附加顺序是不确定的，因为您无法控制池进程的调度，multiprocessing.Array实例初始化如下：

shared_list = Array('i', [0] * 20, lock=False)

然后你的工作函数变成：

def worker(i):
time.sleep(1)
shared_list[i] = i

这里，数组存储在共享内存中，甚至不需要锁定访问，因为每次调用worker都在访问数组的不同索引。访问共享内存数组的元素比访问托管列表的元素快得多。唯一的问题是对共享内存的引用不能作为参数传递，并且我们看到process_map不支持初始值设定项和initargs参数。因此，您将不得不使用较低级别的方法。例如：

import time
from multiprocessing import Pool, Array
from tqdm import tqdm
def init_pool(the_list):
global shared_list
shared_list = the_list
def worker(i):
time.sleep(1)
shared_list[i] = i
if __name__ == '__main__':
# Preallocate 20 slots for the array in shared memory
# And we don't require a lock if each worker invocation is accessing a different Array index:
args = range(20)
shared_list = Array('i', [0] * len(args), lock=False)
with tqdm(total=len(args)) as pbar:
pool = Pool(5, initializer=init_pool, initargs=(shared_list,))
for result in pool.imap_unordered(worker, args):
pbar.update(1)
# print out elements one at a time:
for elem in shared_list:
print(elem)
# print out all elements at once (must first convert to a regular list):
print(list(shared_list))

注释2

我会避免使用process_map。此函数基于ProcessPoolExecutor.map方法的map方法，该方法需要按与正在传递的可迭代的元素相对应的顺序返回结果，而则不按完成顺序返回。想象一下，如果由于某种原因，处理提交的第一个任务(在本例中为i值0)的进程需要很长时间才能处理，并且最终是完成的最后一个任务，会发生什么。在第一个提交的任务完成之前，您会看到tqdm进度条在很长一段时间内什么都不做。但当这种情况发生时，我们知道所有其他提交的任务都已经完成，因此进度条会立即从0跳到100%。按如下方式修改函数worker，以查看其作用：

def worker(shared_list, i):
if i == 0:
time.sleep(5)
else:
time.sleep(.25)
shared_list.append(i)

我上面提供的使用Pool.imap_unordered的代码版本允许无序返回结果，并且默认chunksize值为1，它将按完成顺序返回。进度条将更顺利地进行。

评论3

tqdm中似乎也有一个错误。下面的程序演示了这次如何使用concurrent.futures模块的低级别tqdm调用。不幸的是，它的ProcessPoolExecutor类(用于多处理)和ThreadPoolExecutor类(用于多线程)没有与imap_unordered方法等效的方法。您必须使用submit方法(其multiprocessing.pool.Pool类似于apply_async方法)，该方法返回一个Future实例，您可以在该实例上调用result方法以阻止完成并返回提交任务的结果)。您将submit一堆任务，并将返回的Future实例存储在一个列表中，然后使用as_completed函数调用从该列表中返回下一个已完成的Future实例。

这个演示使用线程并创建一个大小为20的线程池，提交20个任务，因此所有任务都应该同时启动。worker1的睡眠时间设置为变化，因此i参数的值越小，睡眠时间越长。此程序创建池并提交任务4次。第一次，返回值只是打印出来的。第二次使用tqdm进度条。结果如你所料。第三次worker2与tqdm进度条一起使用。不同之处在于，对于i != 0的所有值，睡眠时间都是常数(.25秒)，因此对于i的值1、2。。。19、任务应该在大致相同的时间完成。因此，您希望看到进度条在很短的时间内跳到95%，然后等待i == 0任务完成。然而，你所观察到的恰恰相反。进度条转到5%，并在那里挂很长时间，然后跳到100%。第四种情况是使用CCD_；进度条"；，它的行为正如你所期望的那样。

这是Python 3.8.5下的tqdm4.61.1。我已经在Windows和Linux下测试过了。有人能解释这种行为吗？

import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import sys
class MyProgressBar:
def __init__(self, n_tasks):
self._task_count = n_tasks
self._completed = 0
self.update()
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
print(file=sys.stderr)
return False
def update(self, count=0):
self._completed += count
print(f'r{self._completed} of {self._task_count} task(s) completed.', end='', flush=True)
def worker1(i):
if i == 0:
time.sleep(8)
else:
time.sleep(5 - i/5)
return i
def worker2(i):
if i == 0:
time.sleep(8)
else:
time.sleep(.25)
return i
if __name__ == '__main__':
args = range(20)
with ThreadPoolExecutor(max_workers=20) as pool:
futures = [pool.submit(worker1, arg) for arg in args]
for future in as_completed(futures):
print(future.result())
with ThreadPoolExecutor(max_workers=20) as pool:
with tqdm(total=len(args)) as pbar:
futures = [pool.submit(worker1, arg) for arg in args]
for future in as_completed(futures):
future.result()
pbar.update(1)
with ThreadPoolExecutor(max_workers=20) as pool:
with tqdm(total=len(args)) as pbar:
futures = [pool.submit(worker2, arg) for arg in args]
for future in as_completed(futures):
future.result()
pbar.update(1)
# Works with my progress "bar":
with ThreadPoolExecutor(max_workers=20) as pool:
with MyProgressBar(len(args)) as pbar:
futures = [pool.submit(worker2, arg) for arg in args]
for future in as_completed(futures):
future.result()
pbar.update(1)

相关内容

最新更新

热门标签：