XPath 如何获取子节点文本和自身

我希望XPath获取特定节点和子节点中包含的所有文本。

在下面的示例中，我试图得到："Neil Carmichael （Stroud）（Con）："

<p>
<a class="anchor" name="qn_o0"> </a>
<a class="anchor" name="160210-0001.htm_wqn0"> </a>
<a class="anchor" name="160210109000034"> </a>
1. <a class="anchor" name="160210109000555"> </a>
    <b><b>Neil Carmichael</b>
     "(Stroud) (Con):"
    </b>
    "What assessment he has made of the value to the economy in Scotland of UK membership of the single market. [903484]"
</p>

到目前为止，我设法使用以下代码仅获取一部分或另一部分：

from lxml import html 
import requests 
page = requests.get('http://www.publications.parliament.uk/pa/cm201516/cmhansrd/cm160210/debtext/160210-0001.htm') 
tree = html.fromstring(page.content) 
test2 = tree.xpath('//div[@id="content-small"]/p[(a[@name[starts-with(.,"st_o")]] or a[@name[starts-with(.,"qn_")]])]/b/text()')

欢迎任何帮助！

在 /b 处停止 XPath，以便它返回<b>元素而不是 <b> 中的文本节点。然后，您可以对每个元素调用text_content()以获取预期的文本输出，例如：

from lxml import html
raw = '''<p>
<a class="anchor" name="qn_o0"> </a>
<a class="anchor" name="160210-0001.htm_wqn0"> </a>
<a class="anchor" name="160210109000034"> </a>
1. <a class="anchor" name="160210109000555"> </a>
    <b><b>Neil Carmichael</b>
     "(Stroud) (Con):"
    </b>
    "What assessment he has made of the value to the economy in Scotland of UK membership of the single market. [903484]"
</p>'''
root = html.fromstring(raw)
result = root.xpath('//p/b')
print result[0].text_content()
# output :
# 'Neil Carmichaeln     "(Stroud) (Con):"n    '

作为 text_content() 的替代方法，您可以使用 XPath string()函数，并可选择normalize-space()：

print result[0].xpath('string(normalize-space())')
# output :
# Neil Carmichael "(Stroud) (Con):"

相关内容

最新更新

热门标签：