为什么selenium和firefox网络驱动程序不能抓取ajax加载的wesite标签



我想从bonbast中获取一些HTML标签的文本,其中一些元素是由ajax加载的(例如带有"ounce_top"id的标签(。我试过硒和壁虎驱动器,但我再次无法抓取这些标签,而且当机器人萤火虫(壁虎驱动器(打开时,这些元素不会显示在网页上!我不知道为什么会发生这种事。如何爬网此网站?

代码试用:

from selenium import webdriver
from bs4 import BeautifulSoup
url_news = 'https://bonbast.com/'
driver = webdriver.Firefox()
driver.get(url_news)
html = driver.page_source
soup = BeautifulSoup(html)
a = driver.find_element_by_id(id_="ounce_top")

所需元素是一个动态元素,因此理想情况下,要提取所需文本,即1817.43,您需要诱导WebDriverWait等待visibility_of_element_located((并且您可以使用以下定位器策略之一:

  • 使用CSS_SELECTOR

    driver.get("https://bonbast.com/")
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.btn.btn-primary.btn-sm.acceptcookies"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "span#ounce_top"))).text)
    
  • 使用XPATH:

    driver.get("https://bonbast.com/")
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.btn.btn-primary.btn-sm.acceptcookies"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//span[@id='ounce_top']"))).text)
    
  • 控制台输出:

    1,817.43
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

您可以在"如何使用Selenium-Python 检索WebElement的文本"中找到相关讨论

要使用Selenium实现这一点,您需要添加一个等待/延迟。最好使用预期条件显式等待
我猜您正试图获取该元素中的文本值
这应该有效:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url_news = 'https://bonbast.com/'
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 20)
driver.get(url_news)
html = driver.page_source
soup = BeautifulSoup(html)
your_gold_value = wait.until(EC.visibility_of_element_located((By.ID, "ounce_top"))).text

最新更新