Scrapy和Selenium一起抓取网站

对我来说最大的挑战是用selenium和scrapy刮多个页面，我搜索了很多问题，如何用selenium和scrapy刮多个页面，但我找不到任何解决方案我面临的问题是他们只会抓取一个页面

我使用selenium来抓取多个页面它对我来说是有效的但是selenium抓取多个页面的速度并不比我要快因为它们比selenium要快得多这是页面链接https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx

import scrapy
from selenium import webdriver

class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}

def __init__(self):
self.driver = webdriver.Chrome('C:Program Files (x86)chromedriver.exe')


def parse(self, response):
for k in range(1,10):
books = response.xpath("//div[@class='list-group']//@href").extract()
for book in books:
url = response.urljoin(book)
if url.endswith('.ro') or url.endswith('.ro/'):
continue
yield Request(url, callback=self.parse_book)

next = self.driver.find_element_by_xpath("//a[@id='MainContent_PagerTop_NavNext']")
next.click()

def parse_book(self, response):

title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
d1=d1.strip()
d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
d2=d2.strip()
d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
d3=d3.strip()
d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
d4=d4.strip()


yield{
"title1":title,
"title2":d1,
"title3":d2,
"title4":d3,
"title5":d4,
}

您最好为您的杂乱项目使用或创建一个下载中间件。您可以在文档中找到有关Scrapy下载中间件的所有内容:https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

我建议使用内置库，如scrapy-selenium-middleware

安装库:pip install scrapy-selenium-middleware
在scrapy项目设置文件中设置以下设置:

DOWNLOADER_MIDDLEWARES = {"scrapy_selenium_middleware.SeleniumDownloader":451}
CONCURRENT_REQUESTS = 1 # multiple concurrent browsers are not supported yet
SELENIUM_IS_HEADLESS = False
SELENIUM_PROXY = "http://user:password@my-proxy-server:port" # set to None to not use a proxy
SELENIUM_USER_AGENT = "User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>"           
SELENIUM_REQUEST_RECORD_SCOPE = ["api*"] # a list of regular expression to record the incoming requests by matching the url
SELENIUM_FIREFOX_PROFILE_SETTINGS = {}
SELENIUM_PAGE_LOAD_TIMEOUT = 120

在这里找到更多关于这个库的信息:https://github.com/Tal-Leibman/scrapy-selenium-middleware

相关内容

最新更新

热门标签：