Chrome在几个小时后崩溃,同时通过Python使用Selenium进行多处理



这是抓取几个小时后的错误回溯:

The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.

这是我对硒蟒的设置:

#scrape.py
from selenium.common.exceptions import *
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
def run_scrape(link):
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--lang=en")
chrome_options.add_argument("--start-maximized")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")
chrome_options.binary_location = "/usr/bin/google-chrome"
browser = webdriver.Chrome(executable_path=r'/usr/local/bin/chromedriver', options=chrome_options)
browser.get(<link passed here>)
try:
#scrape process
except:
#other stuffs
browser.quit()
#multiprocess.py
import time,
from multiprocessing import Pool
from scrape import *
if __name__ == '__main__':
start_time = time.time()
#links = list of links to be scraped
pool = Pool(20)
results = pool.map(run_scrape, links)
pool.close()
print("Total Time Processed: "+"--- %s seconds ---" % (time.time() - start_time))

铬, 铬驱动程序设置, 硒版本

ChromeDriver 79.0.3945.36 (3582db32b33893869b8c1339e8f4d9ed1816f143-refs/branch-heads/3945@{#614})
Google Chrome 79.0.3945.79
Selenium Version: 4.0.0a3

我想知道为什么铬正在关闭,但其他进程正在工作?

我采用了您的代码,对其进行了一些修改以适应我的测试环境,以下是执行结果:

  • 代码块:

    • multiprocess.py

      import time
      from multiprocessing import Pool
      from multiprocessingPool.scrape import run_scrape
      if __name__ == '__main__':
      start_time = time.time()
      links = ["https://selenium.dev/downloads/", "https://selenium.dev/documentation/en/"] 
      pool = Pool(2)
      results = pool.map(run_scrape, links)
      pool.close()
      print("Total Time Processed: "+"--- %s seconds ---" % (time.time() - start_time)) 
      
    • scrape.py

      from selenium import webdriver
      from selenium.common.exceptions import NoSuchElementException, TimeoutException
      from selenium.webdriver.common.by import By
      from selenium.webdriver.chrome.options import Options
      def run_scrape(link):
      chrome_options = Options()
      chrome_options.add_argument('--no-sandbox')
      chrome_options.add_argument("--headless")
      chrome_options.add_argument('--disable-dev-shm-usage')
      chrome_options.add_argument("--lang=en")
      chrome_options.add_argument("--start-maximized")
      chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
      chrome_options.add_experimental_option('useAutomationExtension', False)
      chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")
      chrome_options.binary_location=r'C:Program Files (x86)GoogleChromeApplicationchrome.exe'
      browser = webdriver.Chrome(executable_path=r'C:UtilityBrowserDriverschromedriver.exe', options=chrome_options)
      browser.get(link)
      try:
      print(browser.title)
      except (NoSuchElementException, TimeoutException):
      print("Error")
      browser.quit()
      
  • 控制台输出:

    Downloads
    The Selenium Browser Automation Project :: Documentation for Selenium
    Total Time Processed: --- 10.248600006103516 seconds ---
    

结论

很明显,您的程序在逻辑上是完美无缺且完美的。


此用例

正如您提到的,此错误在抓取几个小时后出现,我怀疑这是因为 WebDriver 不是线程安全的。话虽如此,如果可以序列化对基础驱动程序实例的访问,则可以在多个线程中共享引用。这是不可取的。但是,您始终可以为每个线程实例化一个 WebDriver 实例。

理想情况下,线程安全的问题不在于您的代码,而在于实际的浏览器绑定。他们都假设一次只有一个命令(例如,像真实用户一样(。但另一方面,您始终可以为每个线程实例化一个WebDriver实例,这将启动多个浏览选项卡/窗口。到目前为止,您的程序似乎很完美。

现在,不同的线程可以在同一个Web 驱动程序上运行,但测试结果将不是您所期望的。背后的原因是,当您使用多线程在不同的选项卡/窗口上运行不同的测试时,需要一点线程安全编码,否则您将执行的操作(如click()send_keys()(将转到当前具有焦点的打开的选项卡/窗口,无论您期望运行的线程如何。这实质上意味着所有测试将同时在具有焦点但不在预期选项卡/窗口上的同一选项卡/窗口上运行。

现在我使用此线程模块为每个线程实例化一个 Web 驱动程序

import threading
threadLocal = threading.local()
def get_driver():
browser = getattr(threadLocal, 'browser', None)
if browser is None:
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--headless")
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--lang=en")
chrome_options.add_argument("--start-maximized")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")
chrome_options.binary_location = "/usr/bin/google-chrome"
browser = webdriver.Chrome(executable_path=r'/usr/local/bin/chromedriver', options=chrome_options)
setattr(threadLocal, 'browser', browser)
return browser

它确实帮助我比一次执行一个驱动程序更快地抓取。

相关内容

  • 没有找到相关文章