Python树.当文本包含HTML标记时,ElementTree提取的XML文本被截断



我正在使用python的xml.etree. elementtree抓取pubmed xml文档。在文本中嵌入html格式元素会导致为给定xml元素返回碎片文本。下面的xml元素只返回斜体标签以内的文本。

<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>

下面的示例代码可以工作,但无法返回包含html的完整记录。

import xml.etree.ElementTree as ET
xmldata = 'directory/to/data.xml'
tree = ET.parse(xmldata)
root = tree.getroot()
abstracts = {}
for i in range(len(root)):
for child in root[i].iter():
if child.tag == 'ArticleTitle':
title = child.text
titles[i] = title

我也尝试过类似的使用lxml.etree的child.xpath('//AbstractText/text()')。这将文档中的所有文本作为列表元素返回,但没有明确的方法将元素组合到原始摘要中(即,3个摘要可能会返回3个列表元素)。

答案是itertext()—>收集元素的内部文本。

代码应该是这样的:

import xml.etree.ElementTree as ET
from io import StringIO
raw_data="""
<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>
"""
tree = ET.parse(StringIO(raw_data))
root = tree.getroot()
# in the element there is child element, that is reason text was coming till <i>
for e in root.findall("."):
print(e.text,type(e))

Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <class 'xml.etree.ElementTree.Element'>

使用itertext()

"".join(root.find(".").itertext()) # "".join(element.itertext())

'Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which Microdochium species are the most harmful.'

最新更新