如何对隐藏在按钮后面的CSV文件进行网络抓取



我正试图使用python从以下网站下载500多个CSV文件:

https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/search-recherche/lst/results-resultats.cfm?Lang=E&TABID=1&G=1&Geo1=&代码1=&Geo2=&代码2=&GEOCODE=35&type=0#

问题是CSV文件隐藏在几个链接后面。例如:

  1. 初始链接:

https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/search-recherche/lst/results-resultats.cfm?Lang=E&TABID=1&G=1&Geo1=&代码1=&Geo2=&代码2=&GEOCODE=35&type=0#

  1. 子链接示例(顶部有一个向下箭头的下载按钮,需要按下该按钮才能将用户带到另一个链接(:

https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/page.cfm?Lang=E&Geo1=CSD&代码1=3556033&Geo2=PR&代码2=35&SearchText=Abitibi%2070&SearchType=开始&SearchPR=01&B1=全部&GeoLevel=PR&地理代码=3556033&TABID=1&类型=0

  1. 第二个子链接(我感兴趣的是"选项1:下载数据表中显示的数据",文件格式为CSV。需要按下CSV按钮才能下载文件(:

https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/page_Download-Telecharger.cfm?Lang=E&选项卡=1&Geo1=CSD&代码1=3556033&Geo2=PR&代码2=35&SearchText=Abitibi%2070&SearchType=开始&SearchPR=01&B1=全部&TABID=1&类型=0

我正在努力实现与上一篇文章中类似的解决方案。谢谢你的帮助!

尝试:

import requests
from bs4 import BeautifulSoup

main_link = "https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/search-recherche/lst/results-resultats.cfm?Lang=E&TABID=1&G=1&Geo1=&Code1=&Geo2=&Code2=&GEOCODE=35&type=0"
soup = BeautifulSoup(requests.get(main_link).content, "html.parser")
for a in soup.select('details a[href*="page.cfm"]'):
link = a["href"]
link = link.replace(
"../../details/page.cfm",
"https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm",
)
link += "&FILETYPE=CSV"
print(a.get_text(strip=True))
print(link)
print()

打印:

Abitibi 70 (Indian reserve)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3556033&Geo2=PR&Code2=35&SearchText=Abitibi%2070&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3556033&TABID=1&type=0&FILETYPE=CSV
Addington Highlands (Township)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3511035&Geo2=PR&Code2=35&SearchText=Addington%20Highlands&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3511035&TABID=1&type=0&FILETYPE=CSV
Adelaide-Metcalfe (Township)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3539047&Geo2=PR&Code2=35&SearchText=Adelaide-Metcalfe&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3539047&TABID=1&type=0&FILETYPE=CSV
Adjala-Tosorontio (Township)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3543003&Geo2=PR&Code2=35&SearchText=Adjala-Tosorontio&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3543003&TABID=1&type=0&FILETYPE=CSV
Admaston/Bromley (Township)
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/current-actuelle.cfm?Lang=E&Geo1=CSD&Code1=3547043&Geo2=PR&Code2=35&SearchText=Admaston/Bromley&SearchType=Begins&SearchPR=01&B1=All&GeoLevel=PR&GeoCode=3547043&TABID=1&type=0&FILETYPE=CSV
...and so on.

所有第一个子链接的链接都列为<li>元素。在我看来,解析初始链接的HTML文本并将第一个子链接存储在列表中,然后使用selenium导航到它们(并下载CSV(将是可行的方法。

相关内容

  • 没有找到相关文章

最新更新