我试图从这个网站:https://prosettings.net/cs-go-pro-settings-gear-list/的表刮数据。鼠标灵敏度是我一直在努力争取的第一个值。当使用以下xpath进行搜索时,所有需要的元素都会在inspect/developer工具中突出显示://table[@id="table_1"]/tbody/tr/td[8]。
使用webdriverwait和find_elements_by_xpath抓取表元素,使用上面的xpath只返回表中大约475个相同元素中的10个,即使使用webdriverwait给所有元素一个加载的机会,当使用scrollIntoView时,问题可能是数据不会加载而不滚动。这10个元素的唯一共同点是它们是475个元素中唯一出现在html源代码中的元素,我认为这不会是一个问题,因为我正在使用selenium并通过xpath进行搜索。下面是我的代码:
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
PATH = r"your own chromedriver path here"
driver = webdriver.Chrome(executable_path = PATH)
driver.get("https://prosettings.net/cs-go-pro-settings-gear-list/")
rows = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//table[@id="table_1"]/tbody/tr/td[8]')))
for row in rows:
print(row.get_attribute('innerHTML'))
driver.close()
对我来说,这只返回那些你可以在html源代码中找到的鼠标灵敏度值:
2.00
2.40
1.90
2.00
1.87
1.60
1.00
2.20
2.00
3.20
我似乎弄不明白这个!
我们需要首先滚动以使表可见,然后滚动到每一行。试试这样做:
from selenium import webdriver
import time
driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://prosettings.net/cs-go-pro-settings-gear-list/")
time.sleep(10) # Takes time load page.
table = driver.find_element_by_xpath("//table[@id='table_1']") #Find the table scroll into view.
driver.execute_script("arguments[0].scrollIntoView(true);",table)
i = 0
try:
while True:
rows = driver.find_elements_by_xpath("//table[@id='table_1']/tbody/tr/td[7]")
driver.execute_script("arguments[0].scrollIntoView(true);",rows[i])#Find the rows and scroll into view.
print(rows[i].text)
i+=1
except Exception as e:
print(e)
print("Total number of rows = {}".format(i))
driver.quit()
输出:
1.45
2.20
3.09
...
list index out of range
Total number of rows = 475
我实际上不确定这是否是最好的方法,但以下方法在过去对我有效:
driver.get("https://prosettings.net/cs-go-pro-settings-gear-list/")
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//table[@id="table_1"]/tbody/tr/td[8]')))
html = driver.page_source
然后在html
上刮痧。
在这里找到更多关于我的方法:https://simplepush.io/blog/python-web-scraping-javascript-with-selenium