如何获取lxml.Element.xpath不展开实体



如果我使用lxml.etree.XMLParser(resolve_entities=False)来解析XML内容,它会正确地返回没有展开实体的文本节点。(我更希望它只保留实体的文本;相反,它在第一个实体处截断。

from io import BytesIO
from lxml import etree
xml_content = b"""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ELEMENT foo ANY>
<!ENTITY bar "benign">
]>
<body>
<expansion>    This is    a &bar; entity expansion with weird spacing.   </expansion>
</body>"""
nonexpanding_parser = etree.XMLParser(resolve_entities=False)
unexpanded_tree = etree.parse(BytesIO(xml_content), nonexpanding_parser)
elements = unexpanded_tree.xpath('//expansion')
elements[0].text  # '    This is    a '

然而,当我试图调用xpath函数normalize space时,它会扩展实体,我试图避免这种情况:

elements[0].xpath('normalize-space(.)')  # 'This is a benign entity expansion with weird spacing.'

我想我可以编写自己的规范化方法,但我宁愿避免这种情况,而且我不能100%确定该函数的确切规范是什么,我正在代码中替换它,所以我希望它的行为相同。

真正的问题是:我能得到类似elements[0].xpath('normalize-space(.)')的东西吗?它会返回This is a

更好:

  • This is a entity expansion with weird spacing.(这是首选示例(
  • This is a &bar; entity expansion with weird spacing.

这并不完全优雅,但您可以提取所有文本节点,在Python中连接它们,并规范化空间。

(Pdb) text_fragments = unexpanded_tree.xpath('//expansion/text()')
(Pdb) text_fragments
['    This is    a ', ' entity expansion with weird spacing.   ']
(Pdb) ' '.join(''.join(text_fragments).split())
'This is a entity expansion with weird spacing.'

最新更新