我做了一个多处理爬虫。下面是我的代码的简单结构:
class abc:
def all(self):
return "This is abc n what a abc!n"
class bcd:
def all(self):
return "This is bcd n what a bcd!n"
class cde:
def all(self):
return "This is cde n what a cde!n"
class ijk:
def all(self):
return "This is ijk n what a ijk!n"
def crawler(sites, ps_queue):
for site in sites:
ps_queue.put(site.all())
messages = ''
def message_collector(ps_queue):
global messages
while True:
message = ps_queue.get()
messages += message
ps_queue.task_done()
def main():
ps_queue = mp.JoinableQueue()
message_collector_proc = mp.Process(
target=message_collector,
args=(ps_queue, )
)
message_collector_proc.daemon = True
message_collector_proc.start()
site_list = [abc(), bcd(), cde(), ijk(), abc()]
crawler(site_list, ps_queue)
ps_queue.join()
print(messages)
if __name__ == "__main__":
main()
有关这些代码的问题:1.在main()
的末尾,有一个代码print(messages)
但它没有打印出任何东西。为什么会这样?2.有些事情击中了我的头:消息可能会被搞砸,因为每个进程同时访问全局变量messages
。需要lock
吗?或者在这种情况下,每个进程按顺序访问messages
?
谢谢
当您调用mp.Process(target=message_collector, args=(ps_queue, ))
打开一个新进程时,其中创建了一个新的消息对象,它会正确更新,但是当您尝试在主进程中打印它时,它是空的,因为它不是您向其添加数据的同一消息对象。 您需要另一个同步对象,以便可以在进程之间共享messages
, 请参阅如何在 Python 中使用 Managers(( 在多个进程之间共享字符串?
下面是一个使用 Manager.Dict 对象的简单示例(不是最优雅或最有效的解决方案,但绝对是最简单的(
import multiprocessing as mp
class abc:
def all(self):
return "This is abc n what a abc!n"
class bcd:
def all(self):
return "This is bcd n what a bcd!n"
class cde:
def all(self):
return "This is cde n what a cde!n"
class ijk:
def all(self):
return "This is ijk n what a ijk!n"
manager = mp.Manager()
messages_dict = manager.dict()
def crawler(sites, ps_queue):
for site in sites:
ps_queue.put(site.all())
def message_collector(ps_queue):
messages = ''
while True:
message = ps_queue.get()
messages += message
messages_dict['value'] = messages
ps_queue.task_done()
def main():
ps_queue = mp.JoinableQueue()
message_collector_proc = mp.Process(
target=message_collector,
args=(ps_queue, )
)
message_collector_proc.daemon = True
message_collector_proc.start()
site_list = [abc(), bcd(), cde(), ijk(), abc()]
crawler(site_list, ps_queue)
ps_queue.join()
print(messages_dict['value'])