是否可以使用 Python 3 访问具有特定文本的网站中的链接

我正在尝试访问本网站"认证列表"下的前两个链接。https://dph.georgia.gov/wastewater-management

URL 中的日期将根据他们添加新列表的时间而更改。

因此，我只想能够根据其文本"化粪池安装人员"和"化粪池泵送器"导航到这两个链接。

我

不想让任何人为我写代码。我只是在网上找不到任何让我知道要使用哪个模块的东西。

感谢任何和所有的帮助。

例如，我使用它导航到此网址

dls=https://www.sanantonio.gov/DevServ/CrystalReports/BldgActHDMonticelloPrk.xls'
resp = requests.get(dls)

这可以使用BeautifulSoup库来完成。如果您尚未安装它，则可以使用

pip install beautifulsoup4

或

python -m pip install beautifulsoup4

回到问题。您可以使用BeautifulSoup获取包含文本"认证列表">的h3标签之后的p标签，然后获取接下来的两个链接。

import requests
from bs4 import BeautifulSoup
resp=requests.get('https://dph.georgia.gov/wastewater-management')
soup=BeautifulSoup(resp.text,'html.parser')
h3_next_p=soup.find('h3',text='Certified Lists').find_next('p')
for link in h3_next_p.find_all('a')[:2]:
    print(link.get('href'))

输出：

/sites/dph.georgia.gov/files/EnvHealth/Sewage/Contractors/EnvHealthInstallers2019-04-09.pdf
/sites/dph.georgia.gov/files/EnvHealth/Sewage/Contractors/EnvHealthPumpers2019-04-09.pdf

这将返回页面源代码中的href。使用下面的代码获取您可以使用的链接。

print('https://dph.georgia.gov/'+link.get('href'))

相关内容

最新更新

热门标签：