WebScraping的多处理不会在Windows和Mac上启动



几天前我在这里问了一个关于多处理的问题,一位用户给我发来了答案,你可以在下面看到。唯一的问题是这个答案在他的机器上有效,在我的机器上不起作用。

我尝试过Windows(Python 3.6(和Mac(Python 3.8(。我已经在安装附带的基本Python IDLE,Windows上的PyCharm和Jupyter Notebook上运行了代码,但没有任何反应。我有 32 位 Python。 这是代码:

from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta
from multiprocessing import Pool
import tqdm
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
def parse(url):
print("im in function")
response = requests.get(url[4], headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
all_skier_names = soup.find_all("div", class_ = "g-xs-10 g-sm-9 g-md-4 g-lg-4 justify-left bold align-xs-top")
all_countries = soup.find_all("span", class_ = "country__name-short")
discipline = url[0]
season = url[1]
competition = url[2]
gender = url[3]
out = []
for name, country in zip(all_skier_names , all_countries):
skier_name = name.text.strip().title()
country = country.text.strip()
out.append([discipline, season,  competition,  gender,  country,  skier_name])
return out
all_urls = [['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode='],
['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='],
['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode=']]
with Pool(processes=2) as pool, tqdm.tqdm(total=len(all_urls)) as pbar:
all_data = []
print("im in pool")
for data in pool.imap_unordered(parse, all_urls):
print("im in data")
all_data.extend(data)
pbar.update()
print(all_data) 

当我运行代码时,我唯一看到的是进度条,它总是在 0%:

0%|          | 0/8 [00:00<?, ?it/s]

我在代码末尾的parse(url)函数和for loop中设置了几个 print 语句,但仍然,唯一打印的是"im in pool"。 它就像代码根本不进入函数一样,它不会在代码末尾进入 for 循环。

代码应该在 5-8 秒内执行,但我等待了 10 分钟,什么也没发生。我也尝试在没有进度条的情况下执行此操作,但结果是一样的。

你知道问题出在哪里吗? 是我使用的 Python 版本(Python 3.6 32 位(还是某些库的版本,IDK 该怎么办......

更好的选择是多线程,Python 使用线程模块实现:

import threading

if __name__ == "__main__": 
logging.basicConfig(level=logging.INFO)
threads = list()
for scraper in scraper_list:
logging.info("Main    : create and start thread %s.", scraper)
x = threading.Thread(target=scraper_checker, args=(scraper,))
threads.append(x)
x.start()
for index, thread in enumerate(threads):
thread.join()
logging.info("Main    : thread %d done", index)
error_file.close()
success_file.close()


print("Done!") 

相关内容

  • 没有找到相关文章

最新更新