使用Selenium通过xpath查找表元素只返回html源中存在的元素，尽管xpath会突出显示inspect中的所有

我试图从这个网站:https://prosettings.net/cs-go-pro-settings-gear-list/的表刮数据。鼠标灵敏度是我一直在努力争取的第一个值。当使用以下xpath进行搜索时，所有需要的元素都会在inspect/developer工具中突出显示://table[@id="table_1"]/tbody/tr/td[8]。

使用webdriverwait和find_elements_by_xpath抓取表元素，使用上面的xpath只返回表中大约475个相同元素中的10个，即使使用webdriverwait给所有元素一个加载的机会，当使用scrollIntoView时，问题可能是数据不会加载而不滚动。这10个元素的唯一共同点是它们是475个元素中唯一出现在html源代码中的元素，我认为这不会是一个问题，因为我正在使用selenium并通过xpath进行搜索。下面是我的代码:

import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

PATH = r"your own chromedriver path here" 
driver = webdriver.Chrome(executable_path = PATH) 
driver.get("https://prosettings.net/cs-go-pro-settings-gear-list/")
rows = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//table[@id="table_1"]/tbody/tr/td[8]')))
for row in rows:
print(row.get_attribute('innerHTML')) 
driver.close()

对我来说，这只返回那些你可以在html源代码中找到的鼠标灵敏度值:

2.00
2.40
1.90
2.00
1.87
1.60
1.00
2.20
2.00
3.20

我似乎弄不明白这个!

我们需要首先滚动以使表可见，然后滚动到每一行。试试这样做:

from selenium import webdriver
import time

driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://prosettings.net/cs-go-pro-settings-gear-list/")
time.sleep(10) # Takes time load page.
table = driver.find_element_by_xpath("//table[@id='table_1']") #Find the table scroll into view.
driver.execute_script("arguments[0].scrollIntoView(true);",table)
i = 0
try:
while True:
rows = driver.find_elements_by_xpath("//table[@id='table_1']/tbody/tr/td[7]")
driver.execute_script("arguments[0].scrollIntoView(true);",rows[i])#Find the rows and scroll into view.
print(rows[i].text)
i+=1
except Exception as e:
print(e)
print("Total number of rows = {}".format(i))
driver.quit()

输出:

1.45
2.20
3.09
...
list index out of range
Total number of rows = 475

我实际上不确定这是否是最好的方法，但以下方法在过去对我有效:

driver.get("https://prosettings.net/cs-go-pro-settings-gear-list/")
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//table[@id="table_1"]/tbody/tr/td[8]')))
html = driver.page_source

然后在html上刮痧。

在这里找到更多关于我的方法:https://simplepush.io/blog/python-web-scraping-javascript-with-selenium

相关内容

最新更新

热门标签：