Python中的Selenium:在加载所有延迟加载组件后运行抓取代码



硒新手和我在寻找解决方案后仍然有以下问题。

我正在尝试访问此网站上的所有链接(https://www.ecb.europa.eu/press/pressconf/html/index.en.html)。

各个链接被加载到";惰性负载";时尚随着用户向下滚动屏幕,它会逐渐加载。

driver = webdriver.Chrome("chromedriver.exe")
driver.get("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")
# scrolling

lastHeight = driver.execute_script("return document.body.scrollHeight")
#print(lastHeight)

pause = 0.5
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
print(lastHeight)

# ---

elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
url=elem.get_attribute("href")
if re.search('isd+.en.html', url):
print(url)

然而,它只获得最后一个惰性加载元素的必需链接,并且之前的所有内容都没有获得,因为它们没有被加载。

在执行任何抓取代码之前,我想确保所有的惰性加载元素都已加载。我该怎么做?

非常感谢

Selenium不是为web设计的-抓取(尽管在复杂的情况下它可能很有用(。在您的情况下,执行F12->网络,并在向下滚动页面时查看XHR选项卡。您可以看到添加的查询在其url中包含年份。因此,当你向下滚动并进入其他年份时,页面会生成子查询。

查看response选项卡以查找div和类,并构建beautifulsoup"find_all"。一个简单的小循环通过多年的请求和bs就足够了:

import requests as rq
from bs4 import BeautifulSoup as bs

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"}
resultats = []
for year in range(1998, 2021+1, 1):
url = "https://www.ecb.europa.eu/press/pressconf/%s/html/index_include.en.html" % year
resp = rq.get(url, headers=headers)
soup = bs(resp.content)
titles = map(lambda x: x.text, soup.find_all("div", {"class", "title"}))
subtitles = map(lambda x: x.text, soup.find_all("div", {"class", "subtitle"}))
dates = map(lambda x: x.text, soup.find_all("dt"))
zipped = list(zip(dates, titles, subtitles))
resultats.extend(zipped)

结果包含:

...
('8 November 2012',
'Mario Draghi, Vítor Constâncio:xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi,  President of the ECB,  Vítor Constâncio,  Vice-President of the ECB,  Frankfurt am Main,  8 November 2012'),
('4 October 2012',
'Mario Draghi, Vítor Constâncio:xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi,  President of the ECB,  Vítor Constâncio,  Vice-President of the ECB,  Brdo pri Kranju,  4 October 2012'),
...

最新更新