我想提取此网页中的唯一链接。我的代码运行非常流畅。但是,结果是不正确的。不知何故,代码没有在网页中提取足够的链接。它应该是 117 个唯一链接,但代码只返回 90 个唯一链接。有人可以帮助检查我的代码有什么问题吗?谢谢!
import urllib.request
from bs4 import BeautifulSoup
url="https://www.census.gov/programs-surveys/popest.html"
page=urllib.request.urlopen(url)
soup=BeautifulSoup(page,'html.parser')
tags= soup.find_all('a', {"href": True})
b = {tag.get('href') for tag in tags}
for c in b:
print(c)
它似乎对我有用。也许尝试以不同的方式选择链接,如下所示:
import urllib.request
from bs4 import BeautifulSoup
if __name__ == '__main__':
url = "https://www.census.gov/programs-surveys/popest.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
links = [a['href'] for a in soup.select('a[href^="http"]')]
unique_links = set(links)
print(len(links))
print(len(unique_links))
输出:
219
90