剥离html标记之间的空间

我有一个字符串，其中包含一些html标记，如下所示：

"<p>   This is a   test   </p>"

我想去掉标签之间所有多余的空格。我尝试过以下几种：

In [1]: import re
In [2]: val = "<p>   This is a   test   </p>"
In [3]: re.sub("s{2,}", "", val)
Out[3]: '<p>This is atest</p>'
In [4]: re.sub("ss+", "", val)
Out[4]: '<p>This is atest</p>'
In [5]: re.sub("s+", "", val)
Out[5]: '<p>Thisisatest</p>'

但不能得到期望的结果，即<p>This is a test</p>

我怎样才能做到这一点？

尝试使用类似BeautifulSoup:的HTML解析器

from bs4 import BeautifulSoup as BS
s = "<p>   This is a   test   </p>"
soup = BS(s)
soup.find('p').string =  ' '.join(soup.find('p').text.split())
print soup

退货：

<p>This is a test</p>

尝试

re.sub(r's+<', '<', val)
re.sub(r'>s+', '>', val)

然而，对于一般的现实世界使用来说，这太简单了，在现实世界中，经纪不一定总是标签的一部分。（想想<code>块、<script>块等）对于类似的内容，应该使用合适的HTML解析器。

从这个问题中，我看到您正在使用一个非常特定的HTML字符串进行解析。尽管正则表达式快速且脏，但不建议使用——请改用XML解析器。注意：XML比HTML更严格。因此，如果您觉得可能没有XML，请按照@Haidro的建议使用BeautifulSoup。

对于你的情况，你可以这样做：

>>> import xml.etree.ElementTree as ET
>>> p = ET.fromstring("<p>   This is a   test   </p>")
>>> p.text.strip()
'This is a   test'
>>> p.text = p.text.strip()  # If you want to perform more operation on the string, do it here.
>>> ET.tostring(p)
'<p>This is a   test</p>'

这可能会有所帮助：

import re
val = "<p>   This is a   test   </p>"
re_strip_p = re.compile("<p>|</p>")
val = '<p>%s</p>' % re_strip_p.sub('', val).strip()

你可以试试这个：

re.sub(r's+(</)|(<[^/][^>]*>)s+', '$1$2', val);

s = '<p>   This is a   test   </p>'
s = re.sub(r'(s)(s*)', 'g<1>', s)
>>> s
'<p> This is a test </p>'
s = re.sub(r'>s*', '>', s)
s = re.sub(r's*<', '<', s)
>>> s
'<p>This is a test</p>'

相关内容

最新更新

热门标签：