Python Xpath 仅从根元素获取值

我正在使用XPath来废弃一个网页，但是我在代码的一部分上遇到了麻烦：

<div class="description">
   here's the page description
   <span> some other text</span>
   <span> another tag </span>
</div>

我正在使用此代码从元素中获取值：

description = tree.xpath('//div[@class="description"]/text()')

我

可以找到我正在寻找的正确div，但我只想获取文本"这是页面描述"而不是来自内部 span 标签的内容

有人知道我如何只获取根节点中的文本而不从子节点获取内容吗？

您当前使用的表达式实际上仅与顶级文本子节点匹配。您可以将其包装成normalize-space()以清理额外换行符和空格中的文本：

>>> from lxml.html import fromstring
>>> data = """
... <div class="description">
...    here's the page description
...    <span> some other text</span>
...    <span> another tag </span>
... </div>
... """
>>> root = fromstring(data)
>>> root.xpath('normalize-space(//div[@class="description"]/text())')
"here's the page description"

若要获取节点（包括子节点）的完整文本，请使用 .text_content() 方法：

node = tree.xpath('//div[@class="description"]')[0]
print(node.text_content())

相关内容

最新更新

热门标签：