Selenium webscraper不抓取所需的标签



这是我试图抓取的两个标签:https://i.stack.imgur.com/a1sVN.png。如果你想知道,这是到那个页面的链接(我试图抓取的标签不在付费墙后面):https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635

下面是我使用的python代码,有人知道为什么标签不能正确存储在段落中吗?

from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'
driver = webdriver.Chrome()
driver.get(url)
paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')
print(len(paragraphs)) # => prints 0

所以你有两个问题影响你。

  1. 你应该等待页面加载后,你得到()的网页。你可以用import timetime.sleep(10)

    这样做
  2. 您试图抓取的元素,您正在搜索的类标签在每次页面加载时都会发生变化。然而,事实是它是一个data-type='paragraph'保持不变,因此您可以这样做:

paragraphs = driver.find_elements(By.XPATH, '//*[@data-type="paragraph"]') # search by XPath to find the elements with that data attribute
print(len(paragraphs))

加载页面后打印:2

只是为了附加到@Andrew Ryan的答案,您可以使用显式等待更短和更动态的等待时间。

paragraphs = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//*[@data-type="paragraph"]'))
)
print(len(paragraphs))

最新更新