>我正在尝试打开一个网站进行抓取,即在为产品打开一个新标签后,它应该抓取,然后返回原始标签,然后返回其他产品。 我认为问题出在Xpath上,我使用了xpath"//a[contains(@class,'prdLink'(]">
在这里我使用了xpath方法,但不知何故它没有打开页面
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path='C:/Users/ptiwar34/Documents/chromedriver.exe', chrome_options=chromeOptions, desired_capabilities=chromeOptions.to_capabilities())
while True:
try:
driver.get("https://www.besse.com/pages/products-specialties/productsbyspecialty/allspecialties")
my_hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[contains(@class,'prdLink')]")))]
windows_before = driver.current_window_handle
for my_href in my_hrefs:
driver.execute_script("window.open('" + my_href +"');")
WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
windows_after = driver.window_handles
new_window = [x for x in windows_after if x != windows_before][0]
driver.switch_to.window(new_window)
time.sleep(3)
print(driver.title)
driver.close()
driver.switch_to.window(windows_before)
except TimeoutException:
print("No more pages")
break
driver.quit()
它甚至没有打开一个项目,输出也没有更多的页面
xpath是正确的,问题是这些链接不可见。您需要扩展所有部分(并且您需要使用向下滚动来实现这一点(。
在这种情况下,更快的方法是解析页面源代码,而不是在这里使用硒。
from lxml import etree
driver.get("https://www.besse.com/pages/products-specialties/productsbyspecialty/allspecialties")
root = etree.HTML(driver.page_source)
# there is @href!='' in xpath because some hrefs contains empty string
my_hrefs = root.xpath(".//a[contains(@class,'prdLink') and @href!='']/@href")
for my_href in my_hrefs:
# rest of your code