我想像这样抓取一个HTML文件:
<div id="hoge">
<h1><span>title 1</span></h1>
<h2><span>subtitle 1-1</span></h2>
<p></p>
<table class="fuga"><span>data 1-1</span></table>
<p></p>
//(the same structure repeated n times)
<h2><span>subtitle 1-(n+2)<span/></h2>
<p></p>
<table class="fuga"><span>data 1-(n+2)</span></table>
<p></p>
//(the same structure repeated m times)
<h1><span>title m</span></h1>
<h2><span>subtitle m-1</span></h2>
<p></p>
<table class="fuga"><span>data m-1</span></table>
<p></p>
//(the same structure repeated l times)
<h2><span>subtitle m-(l+2)</span></h2>
<p></p>
<table class="fuga"><span>data m-(l+2)</span></table>
<p></p>
</div>
我需要每个字幕( "subtitle x-y"
( 的每个字幕( "title x"
(的表值(在示例中,以 data x-y
表示(。
联想到它们,我想在下一个<h1>
之前切<h1>
~最后一个<p>
,但不知道该怎么做。
我花了 5 个小时搜索、阅读、尝试和错误,终于来编写下面的代码,但它仍然不起作用。
怎么了?如何剪切 HTML?
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://example.com/"))
doc.xpath('//div[@id="mw-content-text"]').each do |node|
for i in 1..node.xpath('h1').length do
mininode = node.xpath(%(node()[not(following-sibling::h1[#{i}] or preceding-sibling::h1[#{i+1}])]))
title = mininode.xpath('h1/span').text
puts title unless title.empty?
puts "============"
for j in 1..mininode.xpath('h2').length do
puts mininode.xpath(%(h2[#{j}]/span)).text
puts mininode.xpath(%(table[#{j}]/span)).text
end
end
end
冥想一下:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="hoge">
<h1><span>title 1</span></h1>
<h2><span>subtitle 1-1</span></h2>
<p></p>
<table class="fuga"><span>data 1-1</span></table>
<p></p>
//(the same structure repeated n times)
<h2><span>subtitle 1-(n+2)<span/></h2>
<p></p>
<table class="fuga"><span>data 1-(n+2)</span></table>
<p></p>
//(the same structure repeated m times)
<h1><span>title m</span></h1>
<h2><span>subtitle m-1</span></h2>
<p></p>
<table class="fuga"><span>data m-1</span></table>
<p></p>
//(the same structure repeated l times)
<h2><span>subtitle m-(l+2)</span></h2>
<p></p>
<table class="fuga"><span>data m-(l+2)</span></table>
<p></p>
</div>
EOT
处理doc
:
div = doc.at('#hoge')
h1_blocks = div.children.slice_before{ |node| node.name == 'h1' }.map{ |nodes| Nokogiri::XML::NodeSet.new(doc, nodes) }
运行会导致h1_blocks
包含节点集数组。这是基于您的HTML的第一组:
h1_blocks[1].map(&:to_html)
# => ["<h1><span>title 1</span></h1>",
# "nn ",
# "<h2><span>subtitle 1-1</span></h2>",
# "n ",
# "<p></p>",
# "n ",
# "<table class="fuga"><span>data 1-1</span></table>",
# "n ",
# "<p></p>",
# "nn //(the same structure repeated n times)nn ",
# "<h2><span>subtitle 1-(n+2)<span></span></span></h2>",
# "n ",
# "<p></p>",
# "n ",
# "<table class="fuga"><span>data 1-(n+2)</span></table>",
# "n ",
# "<p></p>",
# "nnn //(the same structure repeated m times)nn "]
这是基于您的 HTML 的第二组:
h1_blocks[2].map(&:to_html)
# => ["<h1><span>title m</span></h1>",
# "nn ",
# "<h2><span>subtitle m-1</span></h2>",
# "n ",
# "<p></p>",
# "n ",
# "<table class="fuga"><span>data m-1</span></table>",
# "n ",
# "<p></p>",
# "nn //(the same structure repeated l times)nn ",
# "<h2><span>subtitle m-(l+2)</span></h2>",
# "n ",
# "<p></p>",
# "n ",
# "<table class="fuga"><span>data m-(l+2)</span></table>",
# "n ",
# "<p></p>",
# "nnn"]
这是如何工作的?
Ruby 的 Enumerable 类具有slice_before
,它查看比较,对于每个真实结果,将传入数组分解为一个新的子数组。当我们有一个数组元素列表并且我们必须将该数组分解为单独的块时,这很有用。
通常,我们在解析具有某种重复块的文本时使用它,我们需要将其作为块处理,例如段落,网络设备接口等。
一旦节点通过获取<div id="hoge">
标签的子节点进行分块,然后将它们传递到map
这会将它们变回 NodeSets,从而可以轻松地继续像在 Nokogiri 中一样处理它们。