Selenium DOM搜索器返回页面正文而不是WebElement

我正在为Techcrunch，Bloomberg等新闻网站开发爬虫，所有这些网站都具有类似的延迟加载文章卡片的模式，只需单击"加载更多"类型的按钮。

我将其设计为使用multiprocessing并行运行加载过程和摘要过程。对于上下文，下面的run方法位于用于抽象不同站点元素的Crawler类中，因此无需为每个站点编写抓取程序。这是输入方法：

def run(self):
""" Runs a crawler. """
binary: FirefoxBinary = FirefoxBinary(firefox_path="/usr/bin/firefox")
self.driver: Firefox = Firefox(firefox_binary=binary)
self.driver.get(self.url)
self.load_pipe, self.digest_pipe = Pipe()
load_proc: Process = Process(target=self._load_content)
load_proc.start()
digest_proc: Process = Process(target=self._digest_content)
digest_proc.start()

问题出现在加载过程中，在_load_content方法中实现。特别是在第一行，随着find_element_by_class_name的召唤.

def _load_content(self):
""" Loads more content. """
loader: WebElement = self.driver.find_element_by_class_name(self.loader_name)
...

在非并行中同步测试它时，该函数返回一个表示目标按钮的WebElement。但是，当并行运行时，它返回一个表示整个页面正文的str，然后抛出AttributeError: 'str' object has no attribute 'click'。

我确保驱动程序在_load_content中仍然完好无损，但该方法仍然返回str而不是WebElement。奇怪的是，如果没有找到具有给定类标识符的元素，它会引发NoSuchElementException。那么为什么它将 HTML 正文作为str返回呢？我错过了什么？multiprocessing以某种方式弄乱驱动程序 API 吗？

由于浏览器本身的限制，WebDriverAPI 不是线程安全的。浏览器一次需要一个命令，因此进程必须以非并行方式同步运行。即使您的资源可用于执行此操作，运行多个浏览器实例也无法解决问题，因为不会共享状态。

对此的一个潜在解决方案是在加载和摘要过程之间实现回调结构。像这样(伪代码(：

while article cards are available
digest article cards
if no article cards are available
load more article cards
start digesting article cards again

find_element_by_class_name中的故障可能是由驱动程序实例的状态损坏或浏览器绑定无法按 API 预期的方式运行引起的。

相关内容

最新更新

热门标签：