我使用lxml库和python来解析一个简单的XML,该XML在本例中打印下一个元素的文本HD,如下面的XML所示
<BOOK>
<HD>The Best Book Ever</HD>
<HD>Table of Contents</HD>
<EXTRACT>
<TC>I. Introduction</TC>
<TC>II. Summary</TC>
<TC>III. Topic 1</TC>
<TC>IV. Topic 2</TC>
</EXTRACT>
<HD>I. Introduction</HD>
<p>
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
<FTN>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget</FTN>
</p>
<p>has been the industry standard dummy text ever since the 1500s</p>
<HD>II. Summary</HD>
<p>
<FT>data 1</FT>
data 2
<FT>data 3</FT>
</p>
<p>
<FT>data 4</FT>
data 5
<FT>data 6</FT>
</p>
<p>has been the industry standard dummy text ever since the 1500s</p>
<HD>III. Topic 1</HD>
<p>
something
<p>something else</p>
</p>
<HD>IV. Topic 2</HD>
<p>
something1
<p>something else 1</p>
</p>
<p>
something 2
<p>something else 2</p>
</p>
<HD>V. Topic 3</HD>
<p>
something not to show up
<p>because not in EXTRACT as TC</p>
</p>
</BOOK>
我的python代码如下所示,它应该打印HD标签旁边的所有内容
import os
from lxml import etree
file_name = 'demofile2.xml'
full_file_name = os.path.abspath(os.path.join('', file_name))
def load_local_file(filename):
dom = etree.parse(filename)
#get all content of elements after HD tag
TOCsHD = dom.getroot().findall('HD')
for hd in TOCsHD:
text = hd.text
print(text)
for x in hd.getnext().iter():
print(x.text)
print(x.tail)
print("------------------------------")
load_local_file(full_file_name)
我的输出如下所示。正如你所看到的,二。例如,概要不是打印数据4、数据5、数据6。有人能帮我做这个吗?非常感谢!
The Best Book Ever
Table of Contents
------------------------------
Table of Contents
I. Introduction
II. Summary
III. Topic 1
IV. Topic 2
------------------------------
I. Introduction
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
------------------------------
II. Summary
data 1
data 2
data 3
------------------------------
III. Topic 1
something
something else
------------------------------
IV. Topic 2
something1
something else 1
------------------------------
V. Topic 3
something not to show up
because not in EXTRACT as TC
------------------------------
我不确定,但猜测您可能需要的是itersiblings
:
import os
from lxml import etree
file_name = 'demofile2.xml'
full_file_name = os.path.abspath(os.path.join('', file_name))
def load_local_file(filename):
dom = etree.parse(filename)
#get all content of elements after HD tag
TOCsHD = dom.getroot().findall('HD')
for hd in TOCsHD:
print("Siblings of: " + hd.text)
theIter = hd.itersiblings()
for x in theIter:
print(x.tag, "".join(x.itertext()).strip().replace("n", ""), sep=": ")
print("------------------------------")
load_local_file(full_file_name)
我不确定这会是你想要的结果,但如果你对标签的兄弟感兴趣,这个函数会起作用。
输出
Siblings of: The Best Book Ever
HD: Table of Contents
EXTRACT: I. Introduction II. Summary III. Topic 1 IV. Topic 2
HD: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1 data 2 data 3
p: data 4 data 5 data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: Table of Contents
EXTRACT: I. Introduction II. Summary III. Topic 1 IV. Topic 2
HD: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1 data 2 data 3
p: data 4 data 5 data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1 data 2 data 3
p: data 4 data 5 data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: II. Summary
p: data 1 data 2 data 3
p: data 4 data 5 data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
请注意,您还需要使用itertext
才能获得所有标签中的所有文本。例如,有一些p
标记内部有内部标记。如果要获得这些p
标记的文本值,则需要应用itertext
才能获得内部文本。您可以通过查看带有"".join(x.itertext()).strip().replace("n", "")
的行来深入了解该过程。