使用非法特殊字符分析 XML (&)



我有成千上万的XML文件,比如follow

<names>
<Id>1518845</Id>
<Name>Confessions of a Thug (Paperback)</Name>
<Authors>Philip Meadows Taylor</Authors>
<Publisher>Rupa & Co</Publisher>
<CountsOfReview>2.0</CountsOfReview>
</names>

我已经尝试了以下代码来解析

from lxml import etree
root = etree.parse("xm_file.xml")
import xml.etree.ElementTree as ET
tree = ET.parse("xm_file.xml")

parser = ET.XMLParser(encoding="utf-8")
tree = ET.parse("xm_file.xml", parser=parser)

所有这些都会导致其中一个错误

ParseError: not well-formed (invalid token): line 10, column 18
XMLSyntaxError: xmlParseEntityRef: no name, line 10, column 19

我搜索并尝试了很多解决方案,以使其适用于所有文件,但徒劳的

注意:这对我没有帮助:如何解析无效(错误/格式不正确(的XML?

另一种情况是

<names>
<Id>1481744</Id>
<Name>Lettres de René-Édouard Claparède <1832-1871>.: Choisies et annotées</Name>
<Authors>René-Édouard Claparède</Authors>
<ISBN>3796505635</ISBN>
<Rating>2.0</Rating>
<PublishYear>1971</PublishYear>
<PublishMonth>31</PublishMonth>
<PublishDay>12</PublishDay>
</names>

当解析它时,它只处理XML,就好像它是:

<names>
<Id>1481744</Id>
<Name>Lettres de René-Édouard Claparède</Name>
</names>

并且其他信息不会出现

您可以在手之前更换&

import xml.etree.ElementTree as ET
data = """
<names>
<Id>1518845</Id>
<Name>Confessions of a Thug (Paperback)</Name>
<Authors>Philip Meadows Taylor</Authors>
<Publisher>Rupa & Co</Publisher>
<CountsOfReview>2.0</CountsOfReview>
</names>
"""
data = data.replace('&', '&amp;')
tree = ET.ElementTree(ET.fromstring(data))
for publisher in tree.findall("Publisher"):
print(publisher.text)

产生

Rupa & Co

一种可能的方法是在之前加载有问题的文件,替换&并将其提供给xml.etree.ElementTree,如:

with open("some_cool_file") as fp:
content = fp.read()
content = content.replace('&', '&amp;')
xml = ET.ElementTree(ET.fromstring(content))

最新更新