我有成千上万的XML文件,比如follow
<names>
<Id>1518845</Id>
<Name>Confessions of a Thug (Paperback)</Name>
<Authors>Philip Meadows Taylor</Authors>
<Publisher>Rupa & Co</Publisher>
<CountsOfReview>2.0</CountsOfReview>
</names>
我已经尝试了以下代码来解析
from lxml import etree
root = etree.parse("xm_file.xml")
import xml.etree.ElementTree as ET
tree = ET.parse("xm_file.xml")
和
parser = ET.XMLParser(encoding="utf-8")
tree = ET.parse("xm_file.xml", parser=parser)
所有这些都会导致其中一个错误
ParseError: not well-formed (invalid token): line 10, column 18
XMLSyntaxError: xmlParseEntityRef: no name, line 10, column 19
我搜索并尝试了很多解决方案,以使其适用于所有文件,但徒劳的
注意:这对我没有帮助:如何解析无效(错误/格式不正确(的XML?
另一种情况是
<names>
<Id>1481744</Id>
<Name>Lettres de René-Édouard Claparède <1832-1871>.: Choisies et annotées</Name>
<Authors>René-Édouard Claparède</Authors>
<ISBN>3796505635</ISBN>
<Rating>2.0</Rating>
<PublishYear>1971</PublishYear>
<PublishMonth>31</PublishMonth>
<PublishDay>12</PublishDay>
</names>
当解析它时,它只处理XML,就好像它是:
<names>
<Id>1481744</Id>
<Name>Lettres de René-Édouard Claparède</Name>
</names>
并且其他信息不会出现
您可以在手之前更换&
:
import xml.etree.ElementTree as ET
data = """
<names>
<Id>1518845</Id>
<Name>Confessions of a Thug (Paperback)</Name>
<Authors>Philip Meadows Taylor</Authors>
<Publisher>Rupa & Co</Publisher>
<CountsOfReview>2.0</CountsOfReview>
</names>
"""
data = data.replace('&', '&')
tree = ET.ElementTree(ET.fromstring(data))
for publisher in tree.findall("Publisher"):
print(publisher.text)
产生
Rupa & Co
一种可能的方法是在之前加载有问题的文件,替换&
并将其提供给xml.etree.ElementTree
,如:
with open("some_cool_file") as fp:
content = fp.read()
content = content.replace('&', '&')
xml = ET.ElementTree(ET.fromstring(content))