下载网站中的所有文件

我需要下载此链接下的所有文件，其中每个链接中只有郊区名称不断更改

只是一个参考 https://www.data.vic.gov.au/data/dataset/2014-town-and-community-profile-for-thornbury-suburb

此搜索链接下的所有文件： https://www.data.vic.gov.au/data/dataset?q=2014+town+and+community+profile

有什么可能吗？

谢谢:)

你可以像这样下载文件

import urllib2
response = urllib2.urlopen('http://www.example.com/file_to_download')
html = response.read()

获取页面中的所有链接

from bs4 import BeautifulSoup
import requests
r  = requests.get("http://site-to.crawl")
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))

您应该首先阅读html，使用Beautiful Soup解析它，然后根据要下载的文件类型找到链接。例如，如果要下载所有pdf文件，则可以检查链接是否以.pdf扩展名结尾。

这里有一个很好的解释和代码：

https://medium.com/@dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

相关内容

最新更新

热门标签：