BeautifulSoup未正确解析脚本文本/模板

我有一个相当复杂的模板脚本，BeautifulSoup4由于某种原因无法理解。正如您在下面看到的，BS4在放弃之前只对树进行了部分解析。为什么会这样，有没有办法解决？

>>> from bs4 import BeautifulSoup
>>> html = """<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script> Other stuff I want to stay"""
>>> soup = BeautifulSoup(html)
>>> soup.findAll('script')
[<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</script>]

编辑：在进一步的测试中，出于某种原因，BS3似乎能够正确解析：

>>> from BeautifulSoup import BeautifulSoup as bs3
>>> soup = bs3(html)
>>> soup.script
<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script>

Beautiful Soup的默认解析器有时会失败。Beautiful Soup支持Python标准库中包含的HTML解析器，但它也支持许多第三方Python解析器。

在某些情况下，我不得不将解析器更改为其他解析器，如：lxml、html5lib或任何其他解析器。

这是上面解释的一个例子：

from bs4 import BeautifulSoup    
soup = BeautifulSoup(markup, "lxml")

我建议你读这篇http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

相关内容

最新更新

热门标签：