如何转<br> <p> 入换行符?



假设我有一个带有<p><br>标签的HTML。之后,我将剥离 HTML 以清理标签。 如何将它们转换为换行符?

我正在使用Python的BeautifulSoup库,如果这有帮助的话。

如果没有一些细节,很难确定这完全符合您的要求,但这应该会给您带来想法......它假设您的 B 标签被包装在 P 元素中。

from BeautifulSoup import BeautifulSoup
import six
def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, six.string_types):
            text += elem.strip()
        elif elem.name == 'br':
            text += 'n'
    return text
page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""
soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
    line = replace_with_newlines(line)
    print line

运行此操作会导致...

(py26_default)[mpenning@Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt
Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning@Bucksnort ~]$

>get_text似乎可以满足您的需求

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="n")
u'This is a paragraph.nThis is another paragraph.'
这是

@Mike Pennington's Answer的python3版本(它真的很有帮助),我做了一个垃圾重构。

def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        if isinstance(elem, str):
            text += elem.strip()
        elif elem.name == 'br':
            text += 'n'
    return text

def get_plain_text(soup):
    plain_text = ''
    lines = soup.find("body")
    for line in lines.findAll('p'):
        line = replace_with_newlines(line)
        plain_text+=line
    return plain_text

要使用它,只需将 Beautifulsoup 对象传递给 get_plain_text methond。

soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)

我使用以下小库来完成此操作:

https://github.com/TeamHG-Memex/html-text

pip install html-text

简单如下:

>>> import html_text
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hellonnworld!'

我不完全确定您要完成什么,但如果您只是尝试删除 HTML 元素,我只会使用像 Notepad2 这样的程序并使用全部替换功能 - 我认为您也可以使用全部替换插入新行。确保如果替换<p>元素,则还要删除结束(</p>)。此外,仅供参考,正确的HTML5是<br />而不是<br>,但这并不重要。Python不会是我的首选,所以它有点超出我的知识领域,对不起,我帮不上更多忙。

最新更新