lxml-python中的fromstring函数出错



尝试进行

import lxml.etree
xml_str = """
<root>
<H4>
</H4>
<P>
Hong Kong, February 06, 2020 -- </P>
<P>
&bull; Testing data only
</P>
</root>
"""
utf8_parser = lxml.etree.XMLParser(encoding='utf-8')
metadata_xml = lxml.etree.fromstring("""<root>""" + xml_str + """</root>""",
parser=utf8_parser)

我得到一个错误:

File "srclxmletree.pyx", line 3236, in lxml.etree.fromstring
File "srclxmlparser.pxi", line 1876, in lxml.etree._parseMemoryDocument
File "srclxmlparser.pxi", line 1757, in lxml.etree._parseDoc
File "srclxmlparser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc
File "srclxmlparser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "srclxmlparser.pxi", line 711, in lxml.etree._handleParseResult
File "srclxmlparser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 9
lxml.etree.XMLSyntaxError: Entity 'bull' not defined, line 9, column 7

有人知道我该怎么做吗?

正如jordanm所评论的,使用HTML解析器而不是XML解析器。
import lxml.etree
xml_str = r"""
<root>
<H4>
</H4>
<P>
Hong Kong, February 06, 2020 -- </P>
<P>
&bull; Testing data only
</P>
</root>
"""
html_parser = lxml.etree.HTMLParser()
metadata_xml = lxml.etree.fromstring("""<root>""" + xml_str + """</root>""", 
parser=html_parser)

如果您坚持使用XML解析器,您可以取消捕获&bull;字符实体引用,如下所示:

import html
xml_str = html.unescape(xml_str)

最新更新