提取div的内容,包括带有美丽汤4的标签



当我喜欢下面时,你好:

soup.find('div', id='id1')

我得到这样的:

<div id="id1">
<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>
</div>

我只需要这样:

<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>

有没有办法获得上述内容?我尝试了 .content,但没有得到我需要的。

谢谢

from bs4 import BeautifulSoup
html = """<div id="id1">
<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>
</div>"""
soup = BeautifulSoup(html, 'html.parser')
el = soup.find('div', id='id1')
print el.decode_contents(formatter="html")

输出:

<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>

使用contents我得到了以下内容:

[u'n', <p id="ptag"> hello this is "p" tag</p>, u'n', <span id="spantag"> hello this is "p" tag</span>, u'n', <div id="divtag"> hello this is "p" tag</div>, u'n', <h1 id="htag"> hello this is "p" tag</h1>, u'n']

遍历列表,您可以轻松获得所需的输出(跳过n元素(。

我假设 soup.find 是变量名,然后:

soup.find = re.sub("<div>.*</div>", "", soup.find) 

可能有效。

BeautifulSoup中有一个特定的功能可以完全满足您的需求 -unwrap()

Tag.unwrap()wrap()相反。它将标签替换为该标签中的任何内容。它有利于去除标记

工作示例:

from bs4 import BeautifulSoup

data = """
<div id="id1">
<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
soup.div.unwrap()
print(soup)

将打印:

<p id="ptag"> hello this is "p" tag</p>
<span id="spantag"> hello this is "p" tag</span>
<div id="divtag"> hello this is "p" tag</div>
<h1 id="htag"> hello this is "p" tag</h1>

最新更新