如何并行运行Selenium-scrapy



我正在尝试使用scrapy和selenium抓取javascript网站。我使用selenium和chrome驱动程序打开javascript网站,我使用scrapy从当前页面抓取到不同列表的所有链接,并将它们存储在列表中(这是迄今为止最好的方法,因为尝试使用seleniumRequest和回调解析新页面函数来跟踪链接导致了很多错误)。然后,循环遍历url列表,在selenium驱动程序中打开它们并从页面中抓取信息。到目前为止,这刮16页/分钟,这是不理想的清单在这个网站上的数量。理想情况下,我希望selenium驱动程序像以下实现一样并行打开链接:

我如何使硒运行在并行Scrapy?

https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b

然而,我不知道如何在我的硒scrapy代码中实现并行处理。'

import scrapy
import time
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
class MarketPagSpider(scrapy.Spider):
name = 'marketPagination'
def start_requests(self):
yield SeleniumRequest(
url="https://www.cryptoslam.io/nba-top-shot/marketplace",
wait_time=5,
wait_until=EC.presence_of_element_located((By.XPATH, '//SELECT[@name="table_length"]')),
callback=self.parse
)
responses = []
def parse(self, response):
# initialize driver
driver = response.meta['driver']
driver.set_window_size(1920,1080)
time.sleep(1)
WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, "(//th[@class='nowrap sorting'])[1]"))
)
rows = response_obj.xpath("//tbody/tr[@role='row']")
for row in rows:
link = row.xpath(".//td[4]/a/@href").get()
absolute_url = response.urljoin(link)
self.responses.append(absolute_url)
for resp in self.responses:
driver.get(resp)
html = driver.page_source 
response_obj = Selector(text=html)
yield {
'name': response_obj.xpath("//div[@class='ibox-content animated fadeIn fetchable-content js-attributes-wrapper']/h4[4]/span/a/text()").get(),
'price': response_obj.xpath("//span[@class='js-auction-current-price']/text()").get()

}

我知道scrapy-splash可以处理多处理,但是我想要抓取的网站不会在splash中打开(至少我不认为)

同样,我也删除了用于分页的代码行,以保持代码简洁。

我对此非常陌生,并且对使用硒进行多处理的任何建议和解决方案持开放态度。

下面的示例程序创建了一个只有2个线程的线程池,用于演示,然后抓取4个url以获得它们的标题:

from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
import gc
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
# suppress logging:
options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.driver = webdriver.Chrome(options=options)
print('The driver was just created.')
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
print('The driver has terminated.')

threadLocal = threading.local()
def create_driver():
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = Driver()
setattr(threadLocal, 'the_driver', the_driver)
return the_driver.driver

def get_title(url):
driver = create_driver()
driver.get(url)
source = BeautifulSoup(driver.page_source, "lxml")
title = source.select_one("title").text
print(f"{url}: '{title}'")
# just 2 threads in our pool for demo purposes:
with ThreadPool(2) as pool:
urls = [
'https://www.google.com',
'https://www.microsoft.com',
'https://www.ibm.com',
'https://www.yahoo.com'
]
pool.map(get_title, urls)
# must be done before terminate is explicitly or implicitly called on the pool:
del threadLocal
gc.collect()
# pool.terminate() is called at exit of with block

打印:

The driver was just created.
The driver was just created.
https://www.google.com: 'Google'
https://www.microsoft.com: 'Microsoft - Official Home Page'
https://www.ibm.com: 'IBM - United States'
https://www.yahoo.com: 'Yahoo'
The driver has terminated.
The driver has terminated.

相关内容

  • 没有找到相关文章

最新更新