>我有一个简单的代码来准备和运行进程:
with Pool(processes=4) as pool:
pool.map(check_url, range(0, 240000)
这对于验证网站上是否存在页面是必要的,例如 site.com/298 - 存在,site.com/17 - 不存在。所以我需要检查240,000页。问题是当你运行一个脚本时,range(( 给出的值不是按顺序排列的,即我在输出中看到:
Page found: 26545
Page not found: 1523
Page found: 45
Page found: 9
Page found: 4568
Page not found: 256
....
我尝试使用准备好的列表而不是范围:
urls = [i for i in range(0, 240000)]
当我打印出来时,我看到一个按顺序排列的数字列表,但过程仍然继续无序地开始。如何使进程按顺序运行?
UPD:我的解决方案可以检查同一页面两次或更多次吗?
Pool.map 的全部意义在于分离任务并让它们单独执行(https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.map(。如果要按顺序馈送数据,则需要按顺序发送数据,即:
import multiprocessing as mp
from time import sleep
import random
def f(x):
worker_name = mp.current_process().name
print(f"[{x}] by {worker_name}")#start
timetosleep=random.randrange(10)/10
sleep(timetosleep)
print(f"-[{x}] by {worker_name}")#done
return x
if __name__ == '__main__':
print("Init")
with mp.Pool(processes=16) as p:
for i in range(10):
p.apply_async(f, (i,))
p.close()
p.join()
print("Done")
给出输出:
Init
[0] by SpawnPoolWorker-4
[1] by SpawnPoolWorker-2
[2] by SpawnPoolWorker-1
[3] by SpawnPoolWorker-3
[4] by SpawnPoolWorker-5
[5] by SpawnPoolWorker-6
[6] by SpawnPoolWorker-7
[7] by SpawnPoolWorker-8
[8] by SpawnPoolWorker-10
-[7] by SpawnPoolWorker-8
[9] by SpawnPoolWorker-8
-[5] by SpawnPoolWorker-6
-[2] by SpawnPoolWorker-1
-[0] by SpawnPoolWorker-4
-[9] by SpawnPoolWorker-8
-[4] by SpawnPoolWorker-5
-[8] by SpawnPoolWorker-10
-[6] by SpawnPoolWorker-7
-[3] by SpawnPoolWorker-3
-[1] by SpawnPoolWorker-2
Done
如您所见,这些流程是按顺序启动的,但每个流程都需要不同的时间才能完成。如果您需要按顺序完成,则不是一种选择,因为您无法保证这一点。
从 Pool 上的 Python 文档中,您可以看到 'map' 的签名:
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
您的作业是并行提交的,这意味着它们不能保证按顺序执行。如果需要按顺序轮询站点,则并行化可能不是最佳的,您可以考虑使用 for 循环来保证顺序行为。