Downloading PDFs from CAG



我正试图从CAG网站下载多个PDF(链接https://cag.gov.in/en/state-accounts-report?defuat_state_id=64)。我正在使用以下代码-

url='https://cag.gov.in/en/state-accounts-report?defuat_state_id=64'
response=requests.get(url)
response
soup=BeautifulSoup(response.text,'html.parser')
soup
for link in soup.select("a[href$='.pdf']"):

print(link)
for link in soup.select("a[href$='.pdf']"):    

filename = os.path.join(folder_location,link['href'].split('/')[-1])  

with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)

这给了我整个页面的所有PDF,我只想在"月度关键指标"选项卡下下载PDF。请建议对代码进行必要的更改。

您可以尝试缩小选择链接的选项卡。选项卡id可以使用找到

tabId = soup.find(
lambda t: t.name == 'a' and t.get('href') and 
t.get('href').startswith('#tab') and # just in case
'Monthly Key Indicators' == t.get_text(strip=True)
).get('href')

(或者,如果它总是相同的id,您可以设置为tabId = "#tab-360"。(然后,您可以将选择更改为

soup.select(f"{tabId} a[href$='.pdf']")

但你不是在每个报告中下载相同的文件3倍吗?您可以将for循环更改为仅从带有";下载";如文本:

pdfLinks = soup.select(f"{tabId} a[href$='.pdf']")
pdfLinks = [pl for pl in pdfLinks if pl.get_text(strip=True) == 'Download']
for link in pdfLinks:
#download

最新更新