对于我的数据项目,我正试图用selenium来抓取一个网站。它通过增加页码来加载新文章:https://geschenkly.de/page/1/然后是2/3/4,依此类推。但从第一个网站开始,它在chrome webdriver上显示网站,但每当我试图找到一个元素时,它要么是空的,要么不存在:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import json
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("--headless")
chrome_options.add_argument(f'user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
chrome_options.add_argument("window-size=1920,1080")
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, chrome_options=chrome_options)
chrome_options = Options()
#page = 1
driver.get('https://geschenkly.de/page/1/')
wait = wait(driver, 60)
elements = driver.find_elements(By.CLASS_NAME, "woocommerce-LoopProduct-link woocommerce-loop-product__link")
类名是指向文章子域的链接。我可以在检查页面时找到它们,但在硒上,元素是一个空数组
woocommerce-LoopProduct-link woocommerce-loop-product__link
实际上是多个类名。使用By.CLASS_NAME
找不到此类元素
要通过多个类名查找元素,应使用CSS_SELECTOR
或XPATH
您还需要使用期望的条件来等待元素,而不仅仅是在没有使用的情况下定义该元素
此外,您的定位器还可以改进
这样会更好:
driver.get('https://geschenkly.de/page/1/')
wait = wait(driver, 60)
elements = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".woocommerce-LoopProduct-link.woocommerce-loop-product__link")))
有了上面的定位器,你会得到不相关的元素
这将使您的元素比以前少一半,看起来更正确
driver.get('https://geschenkly.de/page/1/')
wait = wait(driver, 60)
elements = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.thumb-wrapper.zoom a.woocommerce-LoopProduct-link.woocommerce-loop-product__link")))