不要自动放置html，头部和body标签，美丽的汤

我使用带有html5lib的beautifulsoup，它会自动放置html、head和body标签：

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

有什么选项我可以设置，关闭这种行为吗？

In [35]: import bs4 as bs
In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>

这将使用Python的内置HTML解析器解析HTML。引用文档：

与html5lib不同，该解析器不尝试创建格式良好的HTML文档中添加<body>标记。与lxml不同，它甚至麻烦添加一个<html>标签。

或者，您可以使用html5lib解析器，只需选择<body>:之后的元素

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')
In [62]: soup.body.next
Out[62]: <h1>FOO</h1>

让我们首先创建一个汤样本：

soup=BeautifulSoup("<head></head><body><p>content</p></body>")

您可以通过指定soup.body.<tag>:来获取html和body的子级

# python3: get body's first child
print(next(soup.body.children))
# if first child's tag is rss
print(soup.body.rss)

您还可以使用unwrap（）来移除主体、头部和html

soup.html.body.unwrap()
if soup.html.select('> head'):
    soup.html.head.unwrap()
soup.html.unwrap()

如果你加载xml文件，bs4.diagnose(data)会告诉你使用lxml-xml，它不会用html+body 包裹你的汤

>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>

BeautifulSoup的这方面一直让我很恼火。

以下是我处理它的方法：

# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')
# Do stuff here
# Extract a string repr of the parse html object, without the <html> or <body> tags
html = "".join([str(x) for x in soup.body.children])

快速分解：

# Iterator object of all tags within the <body> tag (your html before parsing)
soup.body.children
# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)
# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]
# Join all the string objects together to recreate your original html
"".join()

我仍然不喜欢这样，但它完成了任务。当我使用BS4从HTML文档中过滤某些元素和/或属性，然后再对它们进行其他操作时，我总是会遇到这种情况，因为我需要将整个对象作为字符串repr而不是BS4解析的对象返回。

希望下次我用谷歌搜索时，我能在这里找到答案。

您可能误解了BeautifulSoup。BeautifulSoup处理整个HTML文档，而不是HTML片段。您看到的是设计。

如果没有<html>和<body>标记，HTML文档就会损坏。BeautifulSoup让特定的解析器来修复这样的文档，不同的解析器可以修复的程度不同。html5lib是最彻底的解析器，但使用lxml解析器会得到类似的结果（但lxml省略了<head>标记）。html.parser解析器是能力最低的，它可以进行一些修复工作，但它不添加回所需但缺少的标记。

因此，这是html5lib库的一个精心设计的功能，它修复了缺少的HTML，例如添加回缺少的必需元素。

BeautifulSoup没有将您传入的HTML作为片段处理的选项最多您可以使用标准BeautifulSoup树操作方法"破坏"文档并再次删除<html>和<body>元素。

例如，使用Element.replace_with()可以将html元素替换为<h1>元素：

>>> soup = BeautifulSoup('<h1>FOO</h1>', 'html5lib')
>>> soup
<html><head></head><body><h1>FOO</h1></body></html>
>>> soup.html.replace_with(soup.body.contents[0])
<html><head></head><body></body></html>
>>> soup
<h1>FOO</h1>

然而，请考虑到html5lib也可以将其他元素添加到树中，例如tbody元素：

>>> BeautifulSoup(
...     '<table><tr><td>Foo</td><td>Bar</td></tr></table>', 'html5lib'
... ).table
<table><tbody><tr><td>Foo</td><td>Bar</td></tr></tbody></table>

HTML标准规定，一个表应该始终有一个<tbody>元素，如果它丢失了，解析器应该将文档视为该元素存在。html5lib非常非常严格地遵循标准。

另一个解决方案：

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p><p>Hi!</p>', 'lxml')
# content handling example (just for example)
# replace Google with StackOverflow
for a in soup.findAll('a'):
  a['href'] = 'http://stackoverflow.com/'
  a.string = 'StackOverflow'
print ''.join([unicode(i) for i in soup.html.body.findChildren(recursive=False)])

html=str(soup)
html=html.replace("<html><body>","")
html=html.replace("</body></html>","")

将删除html/body标记括号。更复杂的版本还会检查startsWith、endsWith。。。

如果你想让它看起来更好，试试这个：

BeautifulSoup（[您想要分析的内容].prestify（））

以下是的操作方法

a = BeautifulSoup()
a.append(a.new_tag('section'))
#this will give you <section></section>

自v4.0.1以来，有一种方法decode_contents():

>>> BeautifulSoup('<h1>FOO</h1>', 'html5lib').body.decode_contents()
'<h1>FOO</h1>'

解决此问题的更多详细信息：https://stackoverflow.com/a/18602241/237105

更新：

正如@MartijnPieters在评论中正确指出的那样，这样你仍然会得到一些额外的标签，比如tbody（在表格中），你可能想要也可能不想要。

相关内容

最新更新

热门标签：