我正在使用beautiful soup (bf4)查找指向给定网站页面上pdf文件的所有链接。
下面是我从GeeksForGeeks那里得到的代码:
hermes_url = "https://finance.hermes.com/en/publications?type=19"
# Import libraries
import requests
from bs4 import BeautifulSoup
# URL from which pdfs to be downloaded
url = hermes_url
# Requests URL and get response object
response = requests.get(url)
# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')
#result = soup.find_all('div', {'class': 'document'})
# Find all hyperlinks present on webpage
links = soup.find_all('a')
i = 0
# From all links check for pdf link and
# if present download file
for link in links :
print(link)
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)
# Get response object for link
response = requests.get(link.get('href'))
# Write content in pdf file
pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
print("All PDF files downloaded")
问题是,正如我们通过打印找到的链接所看到的,只有"static"页面的各个部分(顶部和底部的分类)都被考虑在内,而主部分(那里有pdf文件)的链接都没有被分析,这意味着我最终没有下载pdf。有谁知道我怎么才能改变这一点并访问页面上的所有链接吗?
from selenium import webdriver
from bs4 import BeautifulSoup as Bs
import time
browser = webdriver.Chrome()
url = 'https://finance.hermes.com/en/publications/'
browser.get(url)
time.sleep(10)
soup = Bs(browser.page_source, 'html.parser')
li=soup.find_all('a', {'class': 'document'})
[i['href'] for i in li if i['href'].endswith('pdf')]
可能需要指定webdriver.chrome('path')
。time.sleep
允许您等待页面加载。等待指定的时间后,我们将使用beautifulsoup
获得页面的源。从源代码中,我们可以使用document
类提取所有a
标记。它返回所有a
标记。但是我们对.pdf
感兴趣,所以最后的列表理解只是获得a
标签的href
属性,以pdf
结束。这只会加载9
pdf文件剩下的你需要让selenium点击loadmore