我正试图使用python从以下网站下载500多个CSV文件:
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/search-recherche/lst/results-resultats.cfm?Lang=E&TABID=1&G=1&Geo1=&代码1=&Geo2=&代码2=&GEOCODE=35&type=0#
问题是CSV文件隐藏在几个链接后面。例如:
- 初始链接:
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/search-recherche/lst/results-resultats.cfm?Lang=E&TABID=1&G=1&Geo1=&代码1=&Geo2=&代码2=&GEOCODE=35&type=0#
- 子链接示例(顶部有一个向下箭头的下载按钮,需要按下该按钮才能将用户带到另一个链接(:
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/page.cfm?Lang=E&Geo1=CSD&代码1=3556033&Geo2=PR&代码2=35&SearchText=Abitibi%2070&SearchType=开始&SearchPR=01&B1=全部&GeoLevel=PR&地理代码=3556033&TABID=1&类型=0
- 第二个子链接(我感兴趣的是"选项1:下载数据表中显示的数据",文件格式为CSV。需要按下CSV按钮才能下载文件(:
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/page_Download-Telecharger.cfm?Lang=E&选项卡=1&Geo1=CSD&代码1=3556033&Geo2=PR&代码2=35&SearchText=Abitibi%2070&SearchType=开始&SearchPR=01&B1=全部&TABID=1&类型=0
我正在努力实现与上一篇文章中类似的解决方案。谢谢你的帮助!
尝试:
import requests
from bs4 import BeautifulSoup
main_link = "https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/search-recherche/lst/results-resultats.cfm?Lang=E&TABID=1&G=1&Geo1=&Code1=&Geo2=&Code2=&GEOCODE=35&type=0"
soup = BeautifulSoup(requests.get(main_link).content, "html.parser")
for a in soup.select('details a[href*="page.cfm"]'):
link = a["href"]
link = link.replace(
"../../details/page.cfm",
"https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm",
)
link += "&FILETYPE=CSV"
print(a.get_text(strip=True))
print(link)
print()
打印:
Abitibi 70 (Indian reserve)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3556033&Geo2=PR&Code2=35&SearchText=Abitibi%2070&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3556033&TABID=1&type=0&FILETYPE=CSV
Addington Highlands (Township)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3511035&Geo2=PR&Code2=35&SearchText=Addington%20Highlands&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3511035&TABID=1&type=0&FILETYPE=CSV
Adelaide-Metcalfe (Township)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3539047&Geo2=PR&Code2=35&SearchText=Adelaide-Metcalfe&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3539047&TABID=1&type=0&FILETYPE=CSV
Adjala-Tosorontio (Township)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3543003&Geo2=PR&Code2=35&SearchText=Adjala-Tosorontio&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3543003&TABID=1&type=0&FILETYPE=CSV
Admaston/Bromley (Township)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3547043&Geo2=PR&Code2=35&SearchText=Admaston/Bromley&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3547043&TABID=1&type=0&FILETYPE=CSV
...and so on.
所有第一个子链接的链接都列为<li>
元素。在我看来,解析初始链接的HTML文本并将第一个子链接存储在列表中,然后使用selenium导航到它们(并下载CSV(将是可行的方法。