美丽汤 html 解析损坏<link>了标签



我正在使用漂亮的汤从rss页面解析html代码。如何保留链接标签?

最有希望的代码是:

python
import urllib.request, urllib.parse, urllib.error 
from bs4 import BeautifulSoup
url = 'https://advisories.ncsc.nl/rss/advisories'
uh = urllib.request.urlopen(url)
html_doc= uh.read()
soup = BeautifulSoup(html_doc, 'html.parser')

我尝试import lxml并将代码切换到 python soup = BeautifulSoup(html_doc, 'xml')但这给了我一个错误:

ModuleNotFoundError: No module named 'lxml'

我希望结果是 <link>https://someurl.org</link>但输出<link/>someurl.org

您正在尝试解析 rss 提要,为此您可以使用提要解析器,即:

import feedparser, requests
feed_xml = requests.get("https://advisories.ncsc.nl/rss/advisories").text
feed = feedparser.parse(feed_xml)
print ('Number of RSS posts :', len(feed.entries))
for entry in feed.entries:
    print (entry.title)
    print (entry.link)
    print (entry.description)

输出:

Number of RSS posts : 25
NCSC-2019-0098 [1.02] [H/M] Kwetsbaarheid verholpen in libreoffice
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0098
Een kwaadwillende kan de kwetsbaarheid mogelijk misbruiken om willekeurige code uit te voeren onder de rechten van een gebruiker.
...

使用 pip 安装 feedparser

pip install feedparser

将解析器更改为 xml 会修复 <link> 标记:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
url = 'https://advisories.ncsc.nl/rss/advisories'
uh = urllib.request.urlopen(url)
html_doc= uh.read()
soup = BeautifulSoup(html_doc, 'xml')    # <-- changing to 'xml'
for link in soup.select('link'):
    print(link.get_text(strip=True))

指纹:

https://advisories.ncsc.nl/rss/advisories
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0098
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0584
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0511
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0583
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0560
https://advisories.ncsc.nl/advisory?id=NCSC-2019-0546
...and so on.

最新更新