我需要使用Nokogiri和CSS或XPath选择器来匹配来自以下HTML的文本。它应该匹配从<div>
标签开始,其中class="propsBar"
和结束匹配在<div>
标签的关闭侧,其中class="oddsInfoBottom"
。应该这样做以识别与此模式的所有匹配:
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<input id="events[X2036-907-Yes-No-081414]" type="hidden" value="X2036-907-Yes-No-081414^No^Yes^Nationals (S Strasburg) @ Met…l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-907-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<input id="events[X2036-915-Yes-No-081414]" type="hidden" value="X2036-915-Yes-No-081414^No^Yes^Astros (S Feldman) @ Red Sox …l there be a score in the 1st Inning?^8/14/2014^7:10 PM^2036" name="events[X2036-915-Yes-No-081414]"></input>
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<input id="events[X2036-917-Yes-No-081414]" type="hidden" value="X2036-917-Yes-No-081414^No^Yes^Rays (J Odorizzi) @ Rangers (…l there be a score in the 1st Inning?^8/14/2014^8:05 PM^2036" name="events[X2036-917-Yes-No-081414]"></input>
<div class="timeBar"></div>
上面的HTML应该返回三个匹配项。
到目前为止,我能够做到这一点的唯一方法是:
one = html.xpath("//div[@class='propsBar']")
two = html.xpath("//div[@class='oddsInfoTop']")
three = html.xpath("//div[@class='oddsInfoBottom']")
one.zip(two, three).flatten.each_slice(3).map(&:join)
这样做的缺点是只返回文本,而不再是Nokogiri元素。此外,我认为以这种方式解析是危险的,如果页面有不同数量的匹配one, two, three
的元素,它将中断。
我会这样写:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
EOT
found_nodes = doc.search('div.propsBar').map{ |node|
nodes = [node]
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}
(注意,我去掉了<input>
标记,因为这些标记只会使输入HTML混乱。当您提供输入数据时,请删除所有噪声
Running返回找到的节点作为数组的数组。每个子数组包含顺序遍历兄弟链后找到的单个节点:
require 'pp'
pp found_nodes
# >> [[#(Element:0x3ff00a4936a0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a037c28 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "n"),
# >> #(Element:0x3ff00a49363c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a03629c { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "n"),
# >> #(Element:0x3ff00a4935b0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4668f8 { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a49354c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45c808 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "n"),
# >> #(Element:0x3ff00a4934e8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45b084 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "n"),
# >> #(Element:0x3ff00a49345c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a8710ec { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a4933f8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4979d0 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "n"),
# >> #(Element:0x3ff00a493394 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a47e188 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "n"),
# >> #(Element:0x3ff00a493308 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a458f00 { name = "class", value = "oddsInfoBottom" })]
# >> })]]
请记住,在解析之后,文档是节点的链表。如果在原始XML或HTML中有一个换行符,则会有一个至少包含一个换行字符("n
")的Text节点。因为它是一个列表,我们可以分别使用next_sibling
和previous_sibling
向前和向后移动。这使得真的很容易抓取小块,即使它们不是包含您想要的内容的块标记。
如果您希望返回值类似于search
, css
或xpath
方法的输出,则需要将内部变量nodes
从Array更改为NodeSet:
found_nodes = doc.search('div.propsBar').map{ |node|
nodes = Nokogiri::XML::NodeSet.new(doc, [node])
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}
require 'pp'
pp found_nodes.map(&:to_html)
运行结果:
# >> ["<div class="propsBar"></div>n<div class="oddsInfoTop"></div>n<div class="oddsInfoBottom"></div>",
# >> "<div class="propsBar"></div>n<div class="oddsInfoTop"></div>n<div class="oddsInfoBottom"></div>",
# >> "<div class="propsBar"></div>n<div class="oddsInfoTop"></div>n<div class="oddsInfoBottom"></div>"]
最后,请注意我使用了CSS选择器而不是XPath。我更喜欢它们,因为它们通常更具可读性和简洁性。XPath更强大,而且因为它是为解析XML而设计的,所以在CSS选择器只能让我们接近我们想要的结果之后,XPath通常可以完成我们在Ruby中不得不做的所有繁重工作。使用能帮你完成工作的,同时考虑什么更容易阅读和维护。
我需要使用Nokogiri、CSS选择器或Xpath来匹配文本下面的HTML。它应该从标签开始匹配class="propsBar"并在标签的结束端结束匹配where class="oddsInfoBottom"
但是它们都是一样的,例如:
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
require 'nokogiri'
doc = Nokogiri::HTML(File.read("xml3.xml"))
doc.css('div.propsBar').each do |div|
puts div.to_html
current_node = div
while current_node = current_node.next_element
puts current_node.to_html
if current_node.has_attribute?'class'
if current_node['class'].match /b oddsInfoBottom b/xm
puts "-" * 10
break #Go get a new starting tag
end
end
end
end
--output:--
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
但是,这样做的缺点是只返回文本,而不再作为Nokogiri元素。
require 'nokogiri'
doc = Nokogiri::HTML(File.read("xml3.xml"))
groups = []
this_group = []
doc.css('div.propsBar').each do |tag|
this_group << tag
current_tag = tag
while current_tag = current_tag.next_element
this_group << current_tag
if current_tag.has_attribute?'class'
if current_tag['class'].match /b oddsInfoBottom b/xm
groups << this_group
this_group = []
break
end
end
end
end
groups.each do |group|
group.each do |tag|
puts tag.to_html
end
puts '-' * 10
end
--output:--
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
----------
使用+
:
doc.search('.propsBar').each do |props_bar|
odds_info_top = props_bar.at('+ .oddsInfoTop')
puts props_bar.text, odds_info_top.text
end