Scrapy + Selenium 为每个 url 请求两次


import scrapy 
from selenium import webdriver
class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()

硒与刮擦动态页面

这个解决方案效果很好,但它两次请求同一个 url,一个由 scrapy 调度程序请求,另一个由 selenium Web 驱动程序请求。

完成工作所需的时间是两倍,与没有硒的刮擦请求相比。如何避免这种情况?

这里有一个可以用来解决这个问题的技巧。

为Selenium创建一个Web服务,在本地运行它

from flask import Flask, request, make_response
from flask_restful import Resource, Api
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
app = Flask(__name__)
api = Api(app)
class Selenium(Resource):
_driver = None
@staticmethod
def getDriver():
if not Selenium._driver:
chrome_options = Options()
chrome_options.add_argument("--headless")
Selenium._driver = webdriver.Chrome(chrome_options=chrome_options)
return Selenium._driver
@property
def driver(self):
return Selenium.getDriver()
def get(self):
url = str(request.args['url'])
self.driver.get(url)
return make_response(self.driver.page_source)
api.add_resource(Selenium, '/')
if __name__ == '__main__':
app.run(debug=True)

现在 http://127.0.0.1:5000/?url=https://stackoverflow.com/users/5939254/yash-pokar 将使用Selenium Chrome/Firefox驱动程序返回编译的网页。

现在我们的蜘蛛会是什么样子,

import scrapy
import urllib

class ProductSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['ebay.com']
urls = [
'http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40',
]
def start_requests(self):
for url in self.urls:
url = 'http://127.0.0.1:5000/?url={}'.format(urllib.quote(url))
yield scrapy.Request(url)
def parse(self, response):
yield {
'field': response.xpath('//td[@class="pagn-next"]/a'),
}

最新更新