在python中填充队列和管理多进程



我在python中遇到这个问题:

  • 我有一个url队列,我需要不时检查
  • 如果队列已满,我需要处理队列
  • 中的每个项目
  • 队列中的每个项目必须由单个进程(多进程)处理

到目前为止,我设法"手动"实现了这个,像这样:

while 1:
        self.updateQueue()
        while not self.mainUrlQueue.empty():
            domain = self.mainUrlQueue.get()
            # if we didn't launched any process yet, we need to do so
            if len(self.jobs) < maxprocess:
                self.startJob(domain)
                #time.sleep(1)
            else:
                # If we already have process started we need to clear the old process in our pool and start new ones
                jobdone = 0
                # We circle through each of the process, until we find one free ; only then leave the loop 
                while jobdone == 0:
                    for p in self.jobs :
                        #print "entering loop"
                        # if the process finished
                        if not p.is_alive() and jobdone == 0:
                            #print str(p.pid) + " job dead, starting new one"
                            self.jobs.remove(p)
                            self.startJob(domain)
                            jobdone = 1

然而,这会导致大量的问题和错误。我想知道我是否更适合使用进程池。正确的做法是什么?

然而,很多时候我的队列是空的,它可以在一秒钟内被300个项目填满,所以我不太确定如何在这里做事情。

您可以使用queue的阻塞功能在启动时生成多个进程(使用multiprocessing.Pool),并让它们休眠,直到队列上有一些数据可供处理。如果你不熟悉这个,你可以试着玩一下。用这个简单的程序:

import multiprocessing
import os
import time
the_queue = multiprocessing.Queue()

def worker_main(queue):
    print os.getpid(),"working"
    while True:
        item = queue.get(True)
        print os.getpid(), "got", item
        time.sleep(1) # simulate a "long" operation
the_pool = multiprocessing.Pool(3, worker_main,(the_queue,))
#                           don't forget the comma here  ^
for i in range(5):
    the_queue.put("hello")
    the_queue.put("world")

time.sleep(10)

在Linux上用Python 2.7.3测试

这将产生3个进程(除了父进程)。每个子节点执行worker_main函数。这是一个简单的循环,每次迭代从队列中获取一个新项。如果没有准备好处理的任务,worker将被阻塞。

在启动时,所有3个进程将休眠,直到队列提供一些数据。当数据可用时,其中一个等待的工作人员获得该项并开始处理它。之后,它尝试从队列中获取另一个项目,如果没有可用的,则再次等待…

添加了一些代码(提交"None"到队列),以便很好地关闭工作线程,并添加代码来关闭和连接the_queue和the_pool:

import multiprocessing
import os
import time
NUM_PROCESSES = 20
NUM_QUEUE_ITEMS = 20  # so really 40, because hello and world are processed separately

def worker_main(queue):
    print(os.getpid(),"working")
    while True:
        item = queue.get(block=True) #block=True means make a blocking call to wait for items in queue
        if item is None:
            break
        print(os.getpid(), "got", item)
        time.sleep(1) # simulate a "long" operation

def main():
    the_queue = multiprocessing.Queue()
    the_pool = multiprocessing.Pool(NUM_PROCESSES, worker_main,(the_queue,))
            
    for i in range(NUM_QUEUE_ITEMS):
        the_queue.put("hello")
        the_queue.put("world")
    
    for i in range(NUM_PROCESSES):
        the_queue.put(None)
    # prevent adding anything more to the queue and wait for queue to empty
    the_queue.close()
    the_queue.join_thread()
    # prevent adding anything more to the process pool and wait for all processes to finish
    the_pool.close()
    the_pool.join()
if __name__ == '__main__':
    main()

最新更新