检查字符串是否具有.pdf扩展名



我对抓取很陌生。我有2个问题。第一个是我需要废弃包含锚标签的网站特定部分。我只需要获取锚标签pdf链接及其标题,但不幸的是,锚标签也有正常链接。

第二个问题是输出有不需要的线路中断。 对于这两个问题,代码是相同的。 对于相同的代码,我有这两个问题。

网站.html

<div>
<a href="www.url.com/somethin.pdf">pdf
link</a>
<a href="www.url.com/somethin.pdf">pdf
link</a>
<a href="www.url.com/somethin">normal
link</a>
</div>

scrappy.py

import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.privacy.gov.ph/advisories/')
soup = BeautifulSoup(page.content,'html.parser')
section = soup.find("section", {"class": "news_content"})
for link in section.find_all("a"):
pdf =  link['href'].replace("..", "")
title =  link.text.strip()
print("title: " + title + "t")
print("pdf_link: " + pdf + "t")
print('n')

如果您运行此代码,您会发现标题对该 html 代码有不需要的新换行符

您案例中的某些标题在正文中具有n- 您应该尝试以下操作:

title =  link.text.strip().replace('n', '')

因此,具有.pdf筛选的最终代码将如下所示:

section = soup.find("section", {"class": "news_content"})
for link in section.find_all("a"):
pdf =  link['href'].replace("..", "")
if not pdf.endswith('.pdf'):
continue
title =  link.text.strip().replace('n', '')
print("title: " + title + "t")
print("pdf_link: " + pdf + "t")
print('n')

您可以使用正则表达式来获取以 pdf 扩展名结尾的 href。对于不需要的换行符,我不确定你的意思。我只能假设你的意思是每个打印之间有 2 行新行。如果这个假设是正确的,那是因为每个print函数都将位于新行上。所以当你有print('n')时,它会打印在新行上,然后打印新行。如果您只需要 1 个空格,请删除最后一个打印功能并将t更改为n

import requests
from bs4 import BeautifulSoup
import re
page = requests.get('https://www.privacy.gov.ph/advisories/')
soup = BeautifulSoup(page.content,'html.parser')
section = soup.find("section", {"class": "news_content"})
links = section.findAll(href=re.compile(".pdf$")) # <---- SEE HERE
for link in links:
pdf =  link['href'].replace("..", "")
title =  link.text.strip().replace('n','')
print("title: " + title)
print("pdf_link: " + pdf + "n")

输出:

title: Updated Templates on Security Incident and Personal Data Breach Reportorial Requirements 
pdf_link: https://www.privacy.gov.ph/wp-content/files/attachments/nwsltr/Final_Advisory18-02_6.26.18.pdf        
title: Guidelines on Privacy Impact Assessments   
pdf_link: https://www.privacy.gov.ph/wp-content/files/attachments/nwsltr/NPC_AdvisoryNo.2017-03.pdf     
title: Access to Personal Data Sheets of Government Personnel 
pdf_link: https://www.privacy.gov.ph/wp-content/files/attachments/nwsltr/NPC_Advisory_No.2017-02.pdf  

最新更新