如何在我的BeautifulSoup WebScraper中实现多处理



我用Python和BeautifulSoup库做了一个网络爬虫,它工作得很好,唯一的问题是它非常慢。所以现在,我想实现一些多处理,以便我可以加快速度,但我不知道如何。

我的代码来自两个标准。 第一部分是抓取网站,以便我可以生成我想要进一步抓取的 URL,并将这些 url 附加到列表中。第一部分如下所示:

from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta
from multiprocessing import Pool
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
links = [["Cross-Country", "https://www.fis-ski.com/DB/cross-country/cup-standings.html", "?sectorcode=CC&seasoncode={}&cupcode={}&disciplinecode=ALL&gendercode={}&nationcode="],
["Ski Jumping", "https://www.fis-ski.com/DB/ski-jumping/cup-standings.html", ""],
["Nordic Combined", "https://www.fis-ski.com/DB/nordic-combined/cup-standings.html", ""],
["Alpine", "https://www.fis-ski.com/DB/alpine-skiing/cup-standings.html", ""]]
# FOR LOOP FOR GENERATING URLS FOR SCRAPING
all_urls = []
for link in links[:1]:

response = requests.get(link[1], headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
discipline = link[0]
print(discipline)
season_list = []
competition_list = []
gender_list = ["M", "L"]

all_seasons = soup.find_all("div", class_ = "select select_size_medium")[0].find_all("option")
for season in all_seasons:
season_list.append(season.text)
all_competitions = soup.find_all("div", class_ = "select select_size_medium")[1].find_all("option")
for competition in all_competitions:
competition_list.append([competition["value"], competition.text])

for gender in gender_list:
for competition in competition_list[:1]:
for season in season_list[:2]:
url = link[1] + link[2].format(season, competition[0], gender)
all_urls.append([discipline, season, competition[1], gender, url])

print(discipline, season, competition[1], gender, url)
print()
print(len(all_urls))   

您的第一部分生成了 4500 多个链接,但我添加了一些索引限制,使其仅生成 8 个链接。这是代码的第二部分,它是一个基本上是一个 for 循环的函数,它逐个 url 并抓取特定数据。第二部分:

# FUNCTION FOR SCRAPING
def parse():
for url in all_urls:
response = requests.get(url[4], headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
all_skier_names = soup.find_all("div", class_ = "g-xs-10 g-sm-9 g-md-4 g-lg-4 justify-left bold align-xs-top")
all_countries = soup.find_all("span", class_ = "country__name-short")

discipline = url[0]
season = url[1]
competition = url[2]
gender = url[3]

for name, country in zip(all_skier_names , all_countries):
skier_name = name.text.strip().title()
country = country.text.strip()

print(discipline, "|", season, "|", competition, "|", gender, "|", country, "|", skier_name)
print()
parse() 

我已经阅读了一些文档,我的多处理部分应该看起来像这样:

p = Pool(10)  # Pool tells how many at a time
records = p.map(parse, all_urls)
p.terminate()
p.join()  

但是我跑了这个,我等了30分钟,什么也没发生。 我做错了什么,如何使用池实现多处理,以便我可以同时抓取 10 个或更多 url?

这是使用multiprocessing.Pool的简单实现。请注意,我使用tqdm模块来显示漂亮的进度条(查看长时间运行的程序中的当前进度很有用(:

from bs4 import BeautifulSoup
import requests
from datetime import date, timedelta
from multiprocessing import Pool
import tqdm
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
def parse(url):
response = requests.get(url[4], headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
all_skier_names = soup.find_all("div", class_ = "g-xs-10 g-sm-9 g-md-4 g-lg-4 justify-left bold align-xs-top")
all_countries = soup.find_all("span", class_ = "country__name-short")
discipline = url[0]
season = url[1]
competition = url[2]
gender = url[3]
out = []
for name, country in zip(all_skier_names , all_countries):
skier_name = name.text.strip().title()
country = country.text.strip()
out.append([discipline, season,  competition,  gender,  country,  skier_name])
return out
# here I hard-coded all_urls:
all_urls = [['Cross-Country', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=M&nationcode='], ['Cross-Country', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/cross-country/cup-standings.html?sectorcode=CC&seasoncode=2020&cupcode=WC&disciplinecode=ALL&gendercode=L&nationcode='], ['Ski Jumping', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/ski-jumping/cup-standings.html'], ['Ski Jumping', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/ski-jumping/cup-standings.html'], ['Nordic Combined', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/nordic-combined/cup-standings.html'], ['Nordic Combined', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/nordic-combined/cup-standings.html'], ['Alpine', '2020', 'World Cup', 'M', 'https://www.fis-ski.com/DB/alpine-skiing/cup-standings.html'], ['Alpine', '2020', 'World Cup', 'L', 'https://www.fis-ski.com/DB/alpine-skiing/cup-standings.html']]
with Pool(processes=2) as pool, tqdm.tqdm(total=len(all_urls)) as pbar: # create Pool of processes (only 2 in this example) and tqdm Progress bar
all_data = []                                                       # into this list I will store the urls returned from parse() function
for data in pool.imap_unordered(parse, all_urls):                   # send urls from all_urls list to parse() function (it will be done concurently in process pool). The results returned will be unordered (returned when they are available, without waiting for other processes)
all_data.extend(data)                                           # update all_data list
pbar.update()                                                   # update progress bar
# Note:
# this for-loop will have 8 iterations (because all_urls has 8 links)
# print(all_data) # <-- this is your data

@andrej-kesely 发布的代码在 Idle 中工作正常。确保代码在应该的位置具有适当的间距

最新更新