lxml获取不带标记的元素的文本



我使用lxml库和python来解析一个简单的XML,该XML在本例中打印下一个元素的文本HD,如下面的XML所示

<BOOK>
<HD>The Best Book Ever</HD>
<HD>Table of Contents</HD>
<EXTRACT>
<TC>I. Introduction</TC>
<TC>II. Summary</TC>
<TC>III. Topic 1</TC>
<TC>IV. Topic 2</TC>
</EXTRACT>
<HD>I. Introduction</HD>
<p>
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
<FTN>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget</FTN>
</p>
<p>has been the industry standard dummy text ever since the 1500s</p>
<HD>II. Summary</HD>
<p>
<FT>data 1</FT>
data 2
<FT>data 3</FT>
</p>
<p>
<FT>data 4</FT>
data 5
<FT>data 6</FT>
</p>
<p>has been the industry standard dummy text ever since the 1500s</p>
<HD>III. Topic 1</HD>
<p>
something
<p>something else</p>
</p>
<HD>IV. Topic 2</HD>
<p>
something1
<p>something else 1</p>
</p>
<p>
something 2
<p>something else 2</p>
</p>
<HD>V. Topic 3</HD>
<p>
something not to show up
<p>because not in EXTRACT as TC</p>
</p>
</BOOK>

我的python代码如下所示,它应该打印HD标签旁边的所有内容

import os
from lxml import etree
file_name = 'demofile2.xml'
full_file_name = os.path.abspath(os.path.join('', file_name))
def load_local_file(filename):
dom = etree.parse(filename)
#get all content of elements after HD tag
TOCsHD = dom.getroot().findall('HD')
for hd in TOCsHD:
text = hd.text
print(text)
for x in hd.getnext().iter():
print(x.text)
print(x.tail)
print("------------------------------")

load_local_file(full_file_name)

我的输出如下所示。正如你所看到的,二。例如,概要不是打印数据4、数据5、数据6。有人能帮我做这个吗?非常感谢!

The Best Book Ever
Table of Contents

------------------------------
Table of Contents


I. Introduction

II. Summary

III. Topic 1

IV. Topic 2

------------------------------
I. Introduction
Lorem Ipsum is simply dummy text of the printing and typesetting industry.


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget

------------------------------
II. Summary


data 1
data 2

data 3

------------------------------
III. Topic 1
something


something else

------------------------------
IV. Topic 2
something1


something else 1

------------------------------
V. Topic 3
something not to show up


because not in EXTRACT as TC

------------------------------

我不确定,但猜测您可能需要的是itersiblings:

import os
from lxml import etree
file_name = 'demofile2.xml'
full_file_name = os.path.abspath(os.path.join('', file_name))
def load_local_file(filename):
dom = etree.parse(filename)
#get all content of elements after HD tag
TOCsHD = dom.getroot().findall('HD')
for hd in TOCsHD:
print("Siblings of: " + hd.text)
theIter = hd.itersiblings()
for x in theIter:
print(x.tag, "".join(x.itertext()).strip().replace("n", ""), sep=": ")
print("------------------------------")
load_local_file(full_file_name)

我不确定这会是你想要的结果,但如果你对标签的兄弟感兴趣,这个函数会起作用。

输出

Siblings of: The Best Book Ever
HD: Table of Contents
EXTRACT: I. Introduction      II. Summary      III. Topic 1      IV. Topic 2
HD: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry.      Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1       data 2      data 3
p: data 4       data 5      data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: Table of Contents
EXTRACT: I. Introduction      II. Summary      III. Topic 1      IV. Topic 2
HD: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry.      Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1       data 2      data 3
p: data 4       data 5      data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry.      Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1       data 2      data 3
p: data 4       data 5      data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: II. Summary
p: data 1       data 2      data 3
p: data 4       data 5      data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: III. Topic 1
p: something      something else
HD: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: IV. Topic 2
p: something1      something else 1
p: something 2      something else 2
HD: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------
Siblings of: V. Topic 3
p: something not to show up      because not in EXTRACT as TC
------------------------------

请注意,您还需要使用itertext才能获得所有标签中的所有文本。例如,有一些p标记内部有内部标记。如果要获得这些p标记的文本值,则需要应用itertext才能获得内部文本。您可以通过查看带有"".join(x.itertext()).strip().replace("n", "")的行来深入了解该过程。

最新更新