使用python抓取多个页面的问题

我正在尝试抓取网页和该网页内的链接。网页是:https://webgate.ec.europa.eu/rasff-window/screen/list。如果你注意到有6000多个通知，这些通知都有单独的链接与它们相关联。我想把所有的链接存储在一个列表中。我使用下面的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from webdriver_manager.chrome import ChromeDriverManager

d = webdriver.Chrome(ChromeDriverManager().install())
#trying this scraping for multiple pages
links = []
i = 1
elems = d.find_elements_by_xpath("//a[@href]")
for elem in elems:
link_list = elem.get_attribute("href")
links.append(link_list)
while True:
print("This is the now the {} page".format(i))
i +=1
time.sleep(1)
try:
time.sleep(0.5)
WebDriverWait(d, 10).until(EC.element_to_be_clickable((By.XPATH, "//button[@aria-label='Next page']"))).click()
print("we have clicked it once")
time.sleep(0.9)

elems2 = d.find_elements_by_xpath("//a[@href]")
for elem2 in elems2:
link_list = elem2.get_attribute("href")
links.append(link_list)
print("The button is clickable")
time.sleep(1)
except:
print("The button is now not clickable, we have collected all the links")
break

这个想法是使用selenium首先从该页找到所有的href链接，然后单击下一页按钮并执行相同的操作，这是我的While循环所做的。但是当我运行这段代码时，它并没有完成整个循环。例如:如果有大约6400个通知，我希望它运行到第64页，但它在中间停止，这表明下一个按钮是不可点击的(除非条件)，尽管按钮实际上是可点击的。这发生在随机页面上，我试过改变时间。睡觉也一样。我做错了什么吗?

我检查了异常消息

except Exception as ex: 
print(ex)

表明问题不是button而是href

似乎有时它在JavaScript更新页面上的所有元素之前得到<a>的引用-然后当它试图从<a>获得href时，错误显示此<a>不存在于页面上，因为同时JavaScript删除了它并放置了新的<a>。

检查按钮是否可点击可能是无用的，因为它一直存在。

你应该在得到<a>之前多睡一会儿。或者你会找到更好的方法来检测你是否得到了新的引用或与以前相同。

相关内容

最新更新

热门标签：