Xpath节点集嵌套顺序选择



是否有一个Xpath 1.0表达式可以从div[@id='rootTag']上下文开始使用,根据嵌套深度选择不同的嵌套跨度子体?例如,您可以使用类似span[2]的东西来选择嵌套第二深的span标记,而不是同一父元素的第二个子span标记吗?

<div id='rootTag'>
<span>Test</span>
<div>   
<span>Test</span>
<span>Test</span>
</div>
</div>  
<span>Test</span>
</div>
<div>  
<div>
<div>  
<div>
<span>Test</span>
</div>
<span>Test</span>
</div>
</div>
</div>
</div>

这有点(很多…(黑客攻击,但可以这样做:

假设你的html是这样的:

levels = """<div id='rootTag'>
<span>Level2</span>
<div>   
<span>Level3</span>
<div>
<span>Level4</span>
</div>
</div>
<div>  
<span>Level3</span>
</div>
<div>  
<div>
<div>  
<div>
<span>Level6</span>
</div>
<span>Level5</span>
</div>
</div>
</div>
</div>"""

然后我们这样做:

#First collect the data:
from lxml import etree #you have to make sure your html is well-formed, or it won't work
root = etree.fromstring(levels)
tree = etree.ElementTree(root)
#collect the paths of all <span> elements
paths = [tree.getpath(e) for e in root.iter('span')]
#determine the nesting level of each <span> element
nests = [e.count('/') for e in paths] #or, alternatively:
#nests = [tree.getpath(e).count('/') for e in root.iter('span')]

从这里开始,我们使用nests列表中的嵌套级别来提取paths列表中的可比较元素。例如,要获得嵌套级别最深的<span>元素:

deepest = nests.index(max(nests))
print(paths[deepest],root.xpath(paths[deepest])[0].text)

输出:

/div/div[3]/div/div/div/span Level6

或者提取具有4级嵌套的<span>元素:

print(paths[nests.index(4)],root.xpath(paths[nests.index(4)])[0].text)

输出:

/div/div[1]/div/span Level4

最新更新