Xpath:所有节点,直到一个节点 ( Wikiquote.org )



文档:http://en.wikiquote.org/wiki/The_Matrix

我想获取第一部分的所有引号(//ul/li)(Neo的引号)。

我不能做//ul[1]/li因为在某些维基语录的页面中,引用以这种形式表示

<h2><span class="mw-headline" id="Neo">Neo</span></h2>  
<ul>
<li> First quote </li>
</ul> 
<ul>
<li> Second quote </li>
</ul> 
<h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>  

而不是

<ul>
<li> First quote </li>
<li> Second quote </li>
</ul>

我已经尝试过这个来获得第一部分

(//*[@id='mw-content-text']/ul/preceding-sibling::h2/span[@class='mw-headline'])[1]

但我无法仅获得第一部分的报价。你能帮我吗?

使用

(//h2[span/@id='Neo'])[1]/following-sibling::ul
[count(.
|
(//h2[span/@id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/@id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li

这将选择紧跟在第一个h2后面的所有li,该子span子项具有值为"Neo"的id属性。

要为第二个这样的h2选择qoutatation,只需将上面的表达式1替换为2

对所有数字执行此操作:1,2, ..., count(//h2[span/@id='Neo'])

基于 XSLT 的验证

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"(//h2[span/@id='Neo'])[1]/following-sibling::ul
[count(.
|
(//h2[span/@id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/@id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
"/>
</xsl:template>
</xsl:stylesheet>

在提供的 XML 文档上应用此转换时:

<html>
<h2><span class="mw-headline" id="Neo">Neo</span></h2>
<ul>
<li> First quote </li>
</ul>
<ul>
<li> Second quote </li>
</ul>
<h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>  >
</html>

计算 XPath 表达式,并将所选节点复制到输出:

<li> First quote </li>
<li> Second quote </li>

解释

这遵循了Kayessian(由Michael Kay博士)两个节点集相交的公式:

$ns1[count(.|$ns2) = count($ns2)]

上面精确地选择了同时属于节点集$ns和节点集$ns2的所有节点。

因此,我们将$ns1替换为由感兴趣的h2的所有后续兄弟姐妹ul组成的节点集。我们将$ns2替换为由所有先前的同级组成的节点集,ulh2,即感兴趣的h2的直接(第一个)后级。

这两个节点集的交集恰好包含所需的所有ul元素。


更新:在评论中,OP表示他只知道他希望结果来自第一部分 - 字符串"Neo"未知。

以下是修改后的解决方案

(//h2[span/@id=$vSectionId])[1]
/following-sibling::ul
[count(.
|
(//h2[span/@id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/@id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li

变量$vSectionId必须作为以下 XPath 表达式的字符串值获取:

substring(//div[h2='Contents']
/following-sibling::ul[1]
/li[1]/a/@href,
2)

在这里,我们从第一个目录条目中ahref中获取通缉id,并跳过第一个字符"#"。

这里再次是基于 XSLT 的验证

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:variable name="vSectionId" select=
"substring(//div[h2='Contents']
/following-sibling::ul[1]
/li[1]/a/@href,
2)
"/>
<xsl:template match="/">
<xsl:copy-of select=
"(//h2[span/@id=$vSectionId])[1]
/following-sibling::ul
[count(.
|
(//h2[span/@id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/@id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
"/>
</xsl:template>
</xsl:stylesheet>

当此转换应用于位于以下位置的完整 XML 文档时: http://en.wikiquote.org/wiki/The_Matrix,应用这两个 XPath 表达式(将第一个表达式的结果替换为第二个表达式,然后计算第二个表达式)的结果是所需的正确表达式

<li>I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.</li>
<li>Whoa.</li>
<li>I know kung-fu.</li>
<li>Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.</li>
<li>Guns.. lots of guns...</li>
<li>There is no spoon.</li>
<li>My name...is Neo!</li>

使用 API 将使解析变得更加容易。下面是一个将拉取第一部分的查询:

http://en.wikiquote.org/w/api.php?action=parse&page=The_Matrix&section=1&prop=wikitext

输出:

<?xml version="1.0"?>
<api>
<parse title="The Matrix">
<wikitext xml:space="preserve">== Neo ==
[[File:The.Matrix.glmatrix.2.png|thumb|right|Unfortunately, no one can be ''told'' what The Matrix is. You have to see it for yourself.]]
[[Image:Arty spoon.jpg|thumb|right|Do not try to bend the spoon — that's impossible. Instead, only try to realize the truth: there is no spoon.]]
* I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.
* Whoa.
* I know kung-fu.
* Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.
* Guns.. lots of guns...
* There is no spoon. 
* My name...is Neo!</wikitext>
</parse>
</api>

以下是解析它的一种方法(使用 HTTParty):

require 'httparty'
class Wikiquote
include HTTParty
base_uri 'en.wikiquote.org/w/'
def self.get_quotes(page)
url = "/api.php?action=parse&page=#{page}&section=1&prop=wikitext&format=xml"
headers = {"User-Agent" => "Wikiquote scraper 1.0"}
content = get(url, headers: headers)['api']['parse']['wikitext']['__content__']
return content.scan(/^* (.*)$/).flatten
end
end

用法:

Wikiquote.get_quotes("The_Matrix")

输出:

["I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.",
"Whoa.",
"I know kung-fu.",
"Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.",
"Guns.. lots of guns...",
"There is no spoon. ",
"My name...is Neo!"]

我建议//ul[preceding-sibling::h2[1][span/@id = 'Neo']]/li.或者,如果id属性也分别不存在与搜索无关,那么根据我认为您想要的评论中的答案

(//h2[span[contains(@class, 'mw-headline')]])[1]/following-sibling::ul
[1 = count(preceding-sibling::h2[1] | (//h2[span[contains(@class, 'mw-headline')]])[1])]/li

请参阅 XPath 轴,获取所有后续节点,直到获得解释,我希望我已经设法正确关闭了所有括号和大括号,现在没有时间进行测试。

相关内容

  • 没有找到相关文章

最新更新