如何从没有标题的span标记中提取文本



我正试图从这个页面和其他一些页面中提取cve。这是链接。https://www.tenable.com/plugins/nessus/19090然而,cve似乎没有标题或任何东西可以让我抓取文本。有办法做到这一点吗?以下是cve的html的样子。

<section>
<h4 class="u-m-t-2">Reference Information</h4>
<section>
<p><strong>CVE
<!-- -->:
</strong><span><a href="/cve/CVE-2004-0804">CVE-2004-0804</a></span></p>
</section>
<section></section>
<div>
<section>
<p><strong>CERT
<!-- -->:
</strong><span><a target="_blank" rel="noopener noreferrer" href="https://www.kb.cert.org/vuls/id/555304">555304</a></span></p>
</section>
</div>
</section>

编辑:这是我的代码目前与杰克·阿什顿的建议。

import bs4 as bs
from urllib.request import urlopen, Request
import urllib
import sys
import re
with open("path to file with id's") as f:
for line in f:
active = line
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
reg_url = "https://www.tenable.com/plugins/nessus/" + str(active) 
req = Request(url=reg_url, headers=headers) 
try:
source = urlopen(req).read()
except urllib.error.HTTPError as e:
if e.getcode() == 404: # check the return code  
continue
if e.getcode() == 502:  
continue        
raise
soup = bs.BeautifulSoup(source,'lxml')
result = re.search(r"<span>(.*CVE.*)</span>", soup)
print(result[0])

使用python,这里有一种从该页面提取CVE的方法。我不确定CVE是什么,你想从中得到什么,但既然你知道";CVE";将在标记文本的href/中,您可以使用regex搜索该标记。你可以根据自己的喜好修改它,这只是开始。

import re
test = """
<section>
<h4 class="u-m-t-2">Reference Information</h4>
<section>
<p><strong>CVE
<!-- -->:
</strong><span><a href="/cve/CVE-2004-0804">CVE-2004-0804</a></span></p>
</section>
<section></section>
<div>
<section>
<p><strong>CERT
<!-- -->:
</strong><span><a target="_blank" rel="noopener noreferrer" href="https://www.kb.cert.org/vuls/id/555304">555304</a></span></p>
</section>
</div>
</section>
"""
result = re.search(r"<span>(.*CVE.*)</span>", test)
print(result[0])  # <a href="/cve/CVE-2004-0804">CVE-2004-0804</a>
import requests
from bs4 import BeautifulSoup
url = 'https://www.tenable.com/plugins/nessus/19090'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print( soup.select_one('a[href*="/cve/CVE"]').text )

打印:

CVE-2004-0804

或者:

print( soup.select_one('strong:contains("CVE:") + span').text )

或者:

print( soup.select_one('h4:contains("Reference Information") + * span').text )
from bs4 import BeautifulSoup
import requests

def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = [
f"{url[:23]+x['href']}" for x in soup.select("a[href^=/cve/CVE-]")]
print(target)

main("https://www.tenable.com/plugins/nessus/19090")

输出:

['https://www.tenable.com/cve/CVE-2004-0804']

最新更新