如何将连续节点与Nokogiri匹配

我需要使用Nokogiri和CSS或XPath选择器来匹配来自以下HTML的文本。它应该匹配从<div>标签开始，其中class="propsBar"和结束匹配在<div>标签的关闭侧，其中class="oddsInfoBottom"。应该这样做以识别与此模式的所有匹配:

<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS 
-->
<input id="events[X2036-907-Yes-No-081414]" type="hidden" value="X2036-907-Yes-No-081414^No^Yes^Nationals (S Strasburg) @ Met…l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-907-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS 
-->
<input id="events[X2036-915-Yes-No-081414]" type="hidden" value="X2036-915-Yes-No-081414^No^Yes^Astros (S Feldman) @ Red Sox …l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-915-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS 
-->
<input id="events[X2036-917-Yes-No-081414]" type="hidden" value="X2036-917-Yes-No-081414^No^Yes^Rays (J Odorizzi) @ Rangers (…l there be a score in the 1st Inning?^8/14/2014^8:05 PM^2036" name="events[X2036-917-Yes-No-081414]"></input>
<div class="timeBar"></div>

上面的HTML应该返回三个匹配项。

到目前为止，我能够做到这一点的唯一方法是:

one = html.xpath("//div[@class='propsBar']")
two = html.xpath("//div[@class='oddsInfoTop']")
three = html.xpath("//div[@class='oddsInfoBottom']")
one.zip(two, three).flatten.each_slice(3).map(&:join)

这样做的缺点是只返回文本，而不再是Nokogiri元素。此外，我认为以这种方式解析是危险的，如果页面有不同数量的匹配one, two, three的元素，它将中断。

我会这样写:

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
 BUY POINTS
-->
<div class="timeBar"></div>
EOT
found_nodes = doc.search('div.propsBar').map{ |node|
  nodes = [node]
  loop do
    node = node.next_sibling
    nodes << node
    break if node['class'] == 'oddsInfoBottom'
  end
  nodes
}

(注意，我去掉了<input>标记，因为这些标记只会使输入HTML混乱。当您提供输入数据时，请删除所有噪声

Running返回找到的节点作为数组的数组。每个子数组包含顺序遍历兄弟链后找到的单个节点:

require 'pp'
pp found_nodes
# >> [[#(Element:0x3ff00a4936a0 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a037c28 { name = "class", value = "propsBar" })]
# >>     }),
# >>   #(Text "n"),
# >>   #(Element:0x3ff00a49363c {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a03629c { name = "class", value = "oddsInfoTop" })]
# >>     }),
# >>   #(Text "n"),
# >>   #(Element:0x3ff00a4935b0 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a4668f8 { name = "class", value = "oddsInfoBottom" })]
# >>     })],
# >>  [#(Element:0x3ff00a49354c {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a45c808 { name = "class", value = "propsBar" })]
# >>     }),
# >>   #(Text "n"),
# >>   #(Element:0x3ff00a4934e8 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a45b084 { name = "class", value = "oddsInfoTop" })]
# >>     }),
# >>   #(Text "n"),
# >>   #(Element:0x3ff00a49345c {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a8710ec { name = "class", value = "oddsInfoBottom" })]
# >>     })],
# >>  [#(Element:0x3ff00a4933f8 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a4979d0 { name = "class", value = "propsBar" })]
# >>     }),
# >>   #(Text "n"),
# >>   #(Element:0x3ff00a493394 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a47e188 { name = "class", value = "oddsInfoTop" })]
# >>     }),
# >>   #(Text "n"),
# >>   #(Element:0x3ff00a493308 {
# >>     name = "div",
# >>     attributes = [
# >>       #(Attr:0x3ff00a458f00 { name = "class", value = "oddsInfoBottom" })]
# >>     })]]

请记住，在解析之后，文档是节点的链表。如果在原始XML或HTML中有一个换行符，则会有一个至少包含一个换行字符("n")的Text节点。因为它是一个列表，我们可以分别使用next_sibling和previous_sibling向前和向后移动。这使得真的很容易抓取小块，即使它们不是包含您想要的内容的块标记。

如果您希望返回值类似于search, css或xpath方法的输出，则需要将内部变量nodes从Array更改为NodeSet:

found_nodes = doc.search('div.propsBar').map{ |node|
  nodes = Nokogiri::XML::NodeSet.new(doc, [node])
  loop do
    node = node.next_sibling
    nodes << node
    break if node['class'] == 'oddsInfoBottom'
  end
  nodes
}
require 'pp'
pp found_nodes.map(&:to_html)

运行结果:

# >> ["<div class="propsBar"></div>n<div class="oddsInfoTop"></div>n<div class="oddsInfoBottom"></div>",
# >>  "<div class="propsBar"></div>n<div class="oddsInfoTop"></div>n<div class="oddsInfoBottom"></div>",
# >>  "<div class="propsBar"></div>n<div class="oddsInfoTop"></div>n<div class="oddsInfoBottom"></div>"]

最后，请注意我使用了CSS选择器而不是XPath。我更喜欢它们，因为它们通常更具可读性和简洁性。XPath更强大，而且因为它是为解析XML而设计的，所以在CSS选择器只能让我们接近我们想要的结果之后，XPath通常可以完成我们在Ruby中不得不做的所有繁重工作。使用能帮你完成工作的，同时考虑什么更容易阅读和维护。

我需要使用Nokogiri、CSS选择器或Xpath来匹配文本下面的HTML。它应该从标签开始匹配class="propsBar"并在标签的结束端结束匹配where class="oddsInfoBottom"

但是它们都是一样的，例如:

<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>

require 'nokogiri'
doc = Nokogiri::HTML(File.read("xml3.xml"))
doc.css('div.propsBar').each do |div|
  puts div.to_html
  current_node = div
  while current_node = current_node.next_element
    puts current_node.to_html
    if current_node.has_attribute?'class'
      if current_node['class'].match /b oddsInfoBottom b/xm
        puts "-" * 10
        break  #Go get a new starting tag
      end
    end
  end
end
--output:--
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------

但是，这样做的缺点是只返回文本，而不再作为Nokogiri元素。

require 'nokogiri'
doc = Nokogiri::HTML(File.read("xml3.xml"))
groups = []
this_group = []
doc.css('div.propsBar').each do |tag|
  this_group << tag
  current_tag = tag
  while current_tag = current_tag.next_element
    this_group << current_tag
    if current_tag.has_attribute?'class'
      if current_tag['class'].match /b oddsInfoBottom b/xm
        groups << this_group
        this_group = []
        break
      end
    end
  end
end

groups.each do |group|
  group.each do |tag|
    puts tag.to_html
  end
  puts '-' * 10
end
--output:--
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------

使用+:

doc.search('.propsBar').each do |props_bar|
  odds_info_top = props_bar.at('+ .oddsInfoTop')
  puts props_bar.text, odds_info_top.text
end

相关内容

最新更新

热门标签：