为什么使用漂亮的汤找不到某些链接



我正在使用beautiful soup (bf4)查找指向给定网站页面上pdf文件的所有链接。

下面是我从GeeksForGeeks那里得到的代码:

hermes_url = "https://finance.hermes.com/en/publications?type=19"
# Import libraries 
import requests 
from bs4 import BeautifulSoup 
# URL from which pdfs to be downloaded 
url = hermes_url
# Requests URL and get response object 
response = requests.get(url) 
# Parse text obtained 
soup = BeautifulSoup(response.text, 'html.parser') 

#result = soup.find_all('div', {'class': 'document'})
# Find all hyperlinks present on webpage 
links = soup.find_all('a') 
i = 0
# From all links check for pdf link and 
# if present download file 
for link in links :
print(link)
for link in links: 
if ('.pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i) 
# Get response object for link 
response = requests.get(link.get('href')) 
# Write content in pdf file 
pdf = open("pdf"+str(i)+".pdf", 'wb') 
pdf.write(response.content) 
pdf.close() 
print("File ", i, " downloaded") 
print("All PDF files downloaded") 

问题是,正如我们通过打印找到的链接所看到的,只有"static"页面的各个部分(顶部和底部的分类)都被考虑在内,而主部分(那里有pdf文件)的链接都没有被分析,这意味着我最终没有下载pdf。有谁知道我怎么才能改变这一点并访问页面上的所有链接吗?

from selenium import webdriver
from bs4 import BeautifulSoup as Bs
import time
browser = webdriver.Chrome()
url = 'https://finance.hermes.com/en/publications/'
browser.get(url)
time.sleep(10)
soup = Bs(browser.page_source, 'html.parser')
li=soup.find_all('a', {'class': 'document'})
[i['href']  for i in li if i['href'].endswith('pdf')]

可能需要指定webdriver.chrome('path')time.sleep允许您等待页面加载。等待指定的时间后,我们将使用beautifulsoup获得页面的源。从源代码中,我们可以使用document类提取所有a标记。它返回所有a标记。但是我们对.pdf感兴趣,所以最后的列表理解只是获得a标签的href属性,以pdf结束。这只会加载9pdf文件剩下的你需要让selenium点击loadmore

最新更新