问题解决了。答案在本教程中。
我一直在运行一个抓取和抓取的scrapy脚本。一切都很好。但在运行的过程中,它总是在某些地方卡住。下面是它显示的内容
[scrapy.extensions.logstats] INFO: Crawled 1795 pages (at 0 pages/min), scraped 1716 items (at 0 items/min)
然后我用control +Z停止代码的运行,并重新运行爬行器。然后,在抓取和抓取一些数据之后,它又卡住了。你以前遇到过这个问题吗?你是怎么克服的?这是整个代码的链接
下面是蜘蛛的代码import scrapy
from scrapy.loader import ItemLoader
from healthgrades.items import HealthgradesItem
from scrapy_playwright.page import PageMethod
# make the header elements like they are in a dictionary
def get_headers(s, sep=': ', strip_cookie=True, strip_cl=True, strip_headers: list = []) -> dict():
d = dict()
for kv in s.split('n'):
kv = kv.strip()
if kv and sep in kv:
v=''
k = kv.split(sep)[0]
if len(kv.split(sep)) == 1:
v = ''
else:
v = kv.split(sep)[1]
if v == '''':
v =''
# v = kv.split(sep)[1]
if strip_cookie and k.lower() == 'cookie': continue
if strip_cl and k.lower() == 'content-length': continue
if k in strip_headers: continue
d[k] = v
return d
# spider class
class DoctorSpider(scrapy.Spider):
name = 'doctor'
allowed_domains = ['healthgrades.com']
url = 'https://www.healthgrades.com/usearch?what=Massage%20Therapy&entityCode=PS444&where=New%20York&pageNu m={}&sort.provider=bestmatch&='
# change the header of bot to look like a browser
def start_requests(self):
h = get_headers(
'''
accept: */*
accept-encoding: gzip, deflate, be
accept-language: en-US,en;q=0.9
dnt: 1
origin: https://www.healthgrades.com
referer: https://www.healthgrades.com/
sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: empty
sec-fetch-mode: cors
vsec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
'''
)
for i in range(1, 6): # Change the range to the page numbers. more improvement can be done
# GET request. url to first page
yield scrapy.Request(self.url.format(i), headers =h, meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [PageMethod('wait_for_selector', 'h3.card-name a')] # for waiting for a particular element to load
))
def parse(self, response):
for link in response.css('div h3.card-name a::attr(href)'): # individual doctor's link
yield response.follow(link.get(), callback = self.parse_categories) # enter into the website
def parse_categories(self, response):
l = ItemLoader(item = HealthgradesItem(), selector = response)
l.add_xpath('name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/h1')
l.add_xpath('specialty', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/div[2]/p/span[1]')
l.add_xpath('practice_name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/p')
l.add_xpath('address', 'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)')
yield l.load_item()
问题是,并发设置是有限制的。
解决方案
并发请求在Scrapy中添加并发实际上是一个非常简单的任务。已经有允许并发请求数的设置,您只需修改即可。
你可以选择在你制作的蜘蛛的自定义设置中修改它,或者在影响所有蜘蛛的全局设置中修改它。
全球>要全局添加,只需将以下行添加到您的设置文件中。
CONCURRENT_REQUESTS = 30
我们将并发请求数设置为30。在合理的范围内,你可以使用任何你想要的值。
当地
要在本地添加设置,我们必须使用自定义设置向Scrapy spider添加并发请求。
<<p>custom_settings = { 'CONCURRENT_REQUESTS' = 30 }
额外设置/strong>您可以使用许多附加设置来代替CONCURRENT_REQUESTS,或者与CONCURRENT_REQUESTS一起使用。
CONCURRENT_REQUESTS_PER_IP
-设置每个IP地址的并发请求数。CONCURRENT_REQUESTS_PER_DOMAIN
-定义每个域允许的并发请求数。MAX_CONCURRENT_REQUESTS_PER_DOMAIN
-设置一个域允许的最大并发请求数限制。