JavaScript 的网页抓取__doPostBack在 TD 中包含 HERF



我想抓取一个网站,即https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=使用硒,但我只能抓取一个页面而不是其他页面。

这里我用的是硒

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path='C:/Users/ptiwar34/Documents/chromedriver.exe', chrome_options=chromeOptions, desired_capabilities=chromeOptions.to_capabilities())
driver.get('https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=')
WebDriverWait(driver, 20).until(EC.staleness_of(driver.find_element_by_xpath("//td/a[text()='2']")))
driver.find_element_by_xpath("//td/a[text()='2']").click()
numLinks = len(WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td/a[text()='2']"))))
print(numLinks)
for i in range(numLinks):
print("Perform your scraping here on page {}".format(str(i+1)))
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//td/a[text()='2']/span//following::span[1]"))).click()
driver.quit()

这是 HTML 内容

<td><span>1</span></td>
<td><a 
href="javascript:__doPostBack 
(&#39;dnn$ctr1535$UNSPSCSearch$gvDetailsSearchView&#39;,&#39;Page$2&#39;)" 
style="color:#333333;">2</a>
</td>

这将引发错误:

raise TimeoutException(message, screen, stacktrace)
TimeoutException

要使用Seleniumhttps://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=抓取网站,您可以使用以下定位器策略:

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    chrome_options = webdriver.ChromeOptions() 
    chrome_options.add_argument("start-maximized")
    driver = webdriver.Chrome(options=chrome_options, executable_path=r'C:WebDriverschromedriver.exe')
    driver.get("https://www.unspsc.org/search-code/default.aspx?CSS=51%&Type=desc&SS%27=%27")
    while True:
    try:
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//table[contains(@id, 'UNSPSCSearch_gvDetailsSearchView')]//tr[last()]//table//span//following::a[1]"))).click()
    print("Clicked for next page")
    except TimeoutException:
    print("No more pages")
    break
    driver.quit()
    
  • 控制台输出:

    Clicked for next page
    Clicked for next page
    Clicked for next page
    .
    .
    .
    
  • 说明:如果您观察 HTML DOM,则页码位于具有包含文本UNSPSCSearch_gvDetailsSearchView的动态id属性的<table>内。此外,页码最后一个<tr>内,即有一个子<table>。在子表中,当前页码位于保存键的<span>内。因此,要click()下一个页码,您只需要使用索引[1]标识以下<a>标签。最后,由于元素具有javascript:__doPostBack(),因此您必须诱导WebDriverWait所需的element_to_be_clickable()

您可以在如何通过Selenium和WebDriver等待JavaScript__doPostBack调用中找到详细的讨论

要查找/单击页码,您可以使用:

for x in driver.find_elements_by_xpath("//a[contains(@href,'UNSPSCSearch$gvDetailsSearchView')]"):
if x.text.isdigit():
print(x.text)
#x.click()
#...

输出:


23
4
...


根据您的评论,您可以使用:

max_pages = 10
for page_number in range(2, max_pages+1):
for x in driver.find_elements_by_xpath("//a[contains(@href,'UNSPSCSearch$gvDetailsSearchView')]"):
if x.text.isdigit():
if int(x.strip()) == page_number:
x.click()
#parse results here
break

最新更新