如何获取lxml.Element.xpath不展开实体

如果我使用lxml.etree.XMLParser(resolve_entities=False)来解析XML内容，它会正确地返回没有展开实体的文本节点。(我更希望它只保留实体的文本；相反，它在第一个实体处截断。

from io import BytesIO
from lxml import etree
xml_content = b"""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ELEMENT foo ANY>
<!ENTITY bar "benign">
]>
<body>
<expansion>    This is    a &bar; entity expansion with weird spacing.   </expansion>
</body>"""
nonexpanding_parser = etree.XMLParser(resolve_entities=False)
unexpanded_tree = etree.parse(BytesIO(xml_content), nonexpanding_parser)
elements = unexpanded_tree.xpath('//expansion')
elements[0].text  # '    This is    a '

然而，当我试图调用xpath函数normalize space时，它会扩展实体，我试图避免这种情况：

elements[0].xpath('normalize-space(.)')  # 'This is a benign entity expansion with weird spacing.'

我想我可以编写自己的规范化方法，但我宁愿避免这种情况，而且我不能100%确定该函数的确切规范是什么，我正在代码中替换它，所以我希望它的行为相同。

真正的问题是：我能得到类似elements[0].xpath('normalize-space(.)')的东西吗？它会返回This is a。

更好：

This is a entity expansion with weird spacing.(这是首选示例(
This is a &bar; entity expansion with weird spacing.

这并不完全优雅，但您可以提取所有文本节点，在Python中连接它们，并规范化空间。

(Pdb) text_fragments = unexpanded_tree.xpath('//expansion/text()')
(Pdb) text_fragments
['    This is    a ', ' entity expansion with weird spacing.   ']
(Pdb) ' '.join(''.join(text_fragments).split())
'This is a entity expansion with weird spacing.'

相关内容

最新更新

热门标签：