用 XPath 查找连续的兄弟姐妹

对于XPath专家来说，这是一个简单的观点！ :)

文档结构：

<tokens>
<token>
<word>Newt</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Gingrich</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>admires</word><entityType>VERB</entityType>
</token>
<token>
<word>Garry</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Trudeau</word><entityType>PROPER_NOUN</entityType>
</token>
</tokens>

忽略文档的语义不可能性，我想拉出[["Newt"，"Gingrich"]，["Garry"，"Trudeau"]]，即：当一行中有两个令牌的实体类型PROPER_NOUN时，我想从这两个标记中提取单词。

我已经做到了：

"//token[entityType='PROPER_NOUN']/following-sibling::token[1][entityType='PROPER_NOUN']"

。这甚至找到了两个连续PROPER_NOUN令牌中的第二个，但我不确定如何让它同时发出第一个令牌。

一些注意事项：

我不介意对 NodeSet 进行更高级别的处理(例如在 Ruby/Nokogiri 中)，如果这简化了问题。
如果有三个或更多连续的PROPER_NOUN令牌(称为 A、B、C)，理想情况下，我想发出 [A， B]， [B， C]。

更新

这是我使用更高级别 Ruby 函数的解决方案。但是我厌倦了所有那些 XPath 恶霸在我脸上踢沙子，我想知道真正的 XPath 程序员是如何做到的！

def extract(doc)
names = []
sentences = doc.xpath("//tokens")
sentences.each do |sentence| 
tokens = sentence.xpath("token")
prev = nil
tokens.each do |token|
name = token.xpath("word").text if token.xpath("entityType").text == "PROPER_NOUN"
names << [prev, name] if (name && prev)
prev = name
end
end
names
end

我会分两步完成。第一步是选择一组节点：

//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]

这为您提供了开始 2 个单词对的所有token。然后要获取实际的对，请遍历节点列表并提取./word并following-sibling::token[1]/word

使用 XmlStarlet(http://xmlstar.sourceforge.net/- 用于快速 xml 操作的出色工具)命令行是

xml sel -t -m "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]" -v word -o "," -v "following-sibling::token[1]/word" -n /tmp/tok.xml

给

Newt,Gingrich
Garry,Trudeau

XmlStarlet 还将该命令行编译为 xslt，相关位为

<xsl:for-each select="//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]">
<xsl:value-of select="word"/>
<xsl:value-of select="','"/>
<xsl:value-of select="following-sibling::token[1]/word"/>
<xsl:value-of select="'&#10;'"/>
</xsl:for-each>

使用Nokogiri，它可能看起来像这样：

#parse the document
doc = Nokogiri::XML(the_document_string)
#select all tokens that start 2-word pair
pair_starts = doc.xpath '//token[entityType = "PROPER_NOUN" and following-sibling::token[1][entityType = "PROPER_NOUN"]]'
#extract each word and the following one
result = pair_starts.each_with_object([]) do |node, array|
array << [node.at_xpath('word').text, node.at_xpath('following-sibling::token[1]/word').text]
end

XPath 1.0 表达式：

/*/token
[entityType='PROPER_NOUN'
and
following-sibling::token[1]/entityType = 'PROPER_NOUN'
]
/word

选择所有"配对中的第一位名词">

此 XPath 表达式：

/*/token
[entityType='PROPER_NOUN'
and
preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
]
/word

选择所有"成对第二名词">

您必须生成实际的对，取两个生成结果节点集中每个节点的第 k 个节点。

基于 XSLT 的验证：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/token
[entityType='PROPER_NOUN'
and
following-sibling::token[1]/entityType = 'PROPER_NOUN'
]
/word
"/>
==============
<xsl:copy-of select=
"/*/token
[entityType='PROPER_NOUN'
and
preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
]
/word
"/>
</xsl:template>
</xsl:stylesheet>

只需计算两个 XPath 表达式并输出这两个计算的结果(使用合适的分隔符可视化第一个结果的结尾和第二个结果的开始)。

应用于提供的 XML 文档时：

<tokens>
<token>
<word>Newt</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Gingrich</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>admires</word><entityType>VERB</entityType>
</token>
<token>
<word>Garry</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Trudeau</word><entityType>PROPER_NOUN</entityType>
</token>
</tokens>

输出为：

<word>Newt</word>
<word>Garry</word>
==============
<word>Gingrich</word>
<word>Trudeau</word>

两个结果的组合(压缩)(您将在最喜欢的PL中指定)为：

["Newt", "Gingrich"]

和

["Garry", "Trudeau"]

当对这个 XML 文档应用相同的转换时(注意我们现在有一个 tripple)：

<tokens>
<token>
<word>Newt</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Gingrich</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Rep</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>admires</word><entityType>VERB</entityType>
</token>
<token>
<word>Garry</word><entityType>PROPER_NOUN</entityType>
</token>
<token>
<word>Trudeau</word><entityType>PROPER_NOUN</entityType>
</token>
</tokens>

现在的结果是：

<word>Newt</word>
<word>Gingrich</word>
<word>Garry</word>
==============
<word>Gingrich</word>
<word>Rep</word>
<word>Trudeau</word>

压缩两个结果会产生正确的、想要的最终结果：

["Newt", "Gingrich"],
["Gingrich", "Rep"],

和

["Garry", "Trudeau"]

注意事项：

可以使用单个 XPath 2.0 表达式生成所需的结果。如果您对 XPath 2.0 解决方案感兴趣，请告诉我。

XPath 返回节点或节点集，但不返回组。所以你必须确定每个组的开始，然后抓住其余的。

first = "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]/word"
next = "../following-sibling::token[1]/word"
doc.xpath(first).map{|word| [word.text, word.xpath(next).text] }

输出：

[["Newt", "Gingrich"], ["Garry", "Trudeau"]]

仅靠XPath不足以完成这项任务。但是在 XSLT 中这很容易：

<xsl:for-each-group select="token" group-adjacent="entityType">
<xsl:if test="current-grouping-key="PROPER_NOUN">
<xsl:copy-of select="current-group">
<xsl:text>====</xsl:text>
<xsl:if>
</xsl:for-each-group>

更新

相关内容

最新更新

热门标签：