在分隔符 (HTML) 中仅打印带有"<a href>的 html 行,使用 BeautifulSoup 堆栈



我正在使用BeautifulSoup打开一个URL,找到标记为"受众容器"的分隔符,然后只打印以"a href"。我已经完成了前两部分(我想(,但不知道如何只从部分提取"a href"行

import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("httm://www.champlain.edu/current-students")
bs = BeautifulSoup(html.read(), "html parser")
for link in bs.find('div', {'id': 'audience-container'}):
print(link) #this prints the full section under audience-container, but not what I want
# print statement to pull out ONLY'a href' that I keep messing up

试试这个:

import requests
from bs4 import BeautifulSoup

main_url = "https://www.champlain.edu"
bs = BeautifulSoup(requests.get(f"{main_url}/current-students").text, "html.parser")
for link in bs.find('div', {"id": "audience-nav"}).find_all("a"):
print(f"{main_url}/{link.get('href')}")

输出:

https://www.champlain.edu/admitted-students
https://www.champlain.edu/current-students
https://www.champlain.edu/prospective-students
https://www.champlain.edu/undergrad-applicants
https://www.champlain.edu/online
https://www.champlain.edu/alumni
https://www.champlain.edu/parents
https://www.champlain.edu/faculty-and-staff
https://www.champlain.edu/school-counselors
https://www.champlain.edu/employer-resources
https://www.champlain.edu/prospective-employees

试试这个:

from bs4 import BeautifulSoup
import requests
url = "http://www.champlain.edu/current-students"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, 'lxml')
for link in soup.find_all('a'):
print(link.get('href'))

在您的代码中有一个错误:httm代替了http。我希望它有用!

最新更新