解析特殊字体中的字符时,lxml崩溃



我尝试使用lxml.etree.parse()解析以下XML文件。该XML在第3行包含一个特殊字体的字符。此字符来自Dingbats字体值0x 7——一个电话象形图。在Notepad++中,它显示为BEL——黑色矩形内的白色字母。我无法说服这个问题。

<!DOCTYPE qgis PUBLIC 'http://mrcc.com/qgis.dtd' 'SYSTEM'>
      <layer pass="0" class="FontMarker" locked="0">
      <prop k="chr" v="!!!SPECIAL_CARACTER_HERE!!!"/>
      </layer>
</qgis>

这个字符使lxml(xml也崩溃)崩溃,并出现以下错误:

  File "lxml.etree.pyx", line 3193, in lxml.etree.parse (src/lxml/lxml.etree.c:64168)
  File "parser.pxi", line 1548, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:91390)
  File "parser.pxi", line 1577, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:91674)
  File "parser.pxi", line 1477, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:90741)
  File "parser.pxi", line 1024, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:87655)
  File "parser.pxi", line 565, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:83243)
  File "parser.pxi", line 656, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:84225)
  File "parser.pxi", line 596, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83549)
lxml.etree.XMLSyntaxError: invalid character in attribute value, line 3, column 14

如何解析这样的文档?

更新:指向文件本身的链接。

似乎lxml无法与之竞争。不过,您可以使用recover来处理错误。

恢复-尝试通过损坏的XML 进行解析

>>> from lxml import etree
>>> parser = etree.XMLParser(recover=True)
>>> tree = etree.parse("/tmp/qgis.xml", parser=parser)
>>> tree.find("layer/prop").attrib
{'v': '', 'k': 'chr'}

最新更新