如何使用nokogiri在一对相同的标签之间获取HTML



我想像这样抓取一个HTML文件:

<div id="hoge">
  <h1><span>title 1</span></h1>
    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>
    //(the same structure repeated n times)
    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>

  //(the same structure repeated m times)
  <h1><span>title m</span></h1>
    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>
    //(the same structure repeated l times)
    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>

</div>

我需要每个字幕( "subtitle x-y" ( 的每个字幕( "title x" (的表值(在示例中,以 data x-y 表示(。
联想到它们,我想在下一个<h1>之前切<h1>~最后一个<p>,但不知道该怎么做。
我花了 5 个小时搜索、阅读、尝试和错误,终于来编写下面的代码,但它仍然不起作用。
怎么了?如何剪切 HTML?

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://example.com/"))
doc.xpath('//div[@id="mw-content-text"]').each do |node|
  for i in 1..node.xpath('h1').length do
    mininode = node.xpath(%(node()[not(following-sibling::h1[#{i}] or preceding-sibling::h1[#{i+1}])]))
    title = mininode.xpath('h1/span').text
    puts title unless title.empty?
    puts "============"
    for j in 1..mininode.xpath('h2').length do
      puts mininode.xpath(%(h2[#{j}]/span)).text
      puts mininode.xpath(%(table[#{j}]/span)).text
    end
  end
end

冥想一下:

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="hoge">
  <h1><span>title 1</span></h1>
    <h2><span>subtitle 1-1</span></h2>
    <p></p>
    <table class="fuga"><span>data 1-1</span></table>
    <p></p>
    //(the same structure repeated n times)
    <h2><span>subtitle 1-(n+2)<span/></h2>
    <p></p>
    <table class="fuga"><span>data 1-(n+2)</span></table>
    <p></p>

  //(the same structure repeated m times)
  <h1><span>title m</span></h1>
    <h2><span>subtitle m-1</span></h2>
    <p></p>
    <table class="fuga"><span>data m-1</span></table>
    <p></p>
    //(the same structure repeated l times)
    <h2><span>subtitle m-(l+2)</span></h2>
    <p></p>
    <table class="fuga"><span>data m-(l+2)</span></table>
    <p></p>

</div>
EOT

处理doc

div = doc.at('#hoge')
h1_blocks = div.children.slice_before{ |node| node.name == 'h1' }.map{ |nodes| Nokogiri::XML::NodeSet.new(doc, nodes) }

运行会导致h1_blocks包含节点集数组。这是基于您的HTML的第一组:

h1_blocks[1].map(&:to_html)
# => ["<h1><span>title 1</span></h1>",
#     "nn    ",
#     "<h2><span>subtitle 1-1</span></h2>",
#     "n    ",
#     "<p></p>",
#     "n    ",
#     "<table class="fuga"><span>data 1-1</span></table>",
#     "n    ",
#     "<p></p>",
#     "nn    //(the same structure repeated n times)nn    ",
#     "<h2><span>subtitle 1-(n+2)<span></span></span></h2>",
#     "n    ",
#     "<p></p>",
#     "n    ",
#     "<table class="fuga"><span>data 1-(n+2)</span></table>",
#     "n    ",
#     "<p></p>",
#     "nnn  //(the same structure repeated m times)nn  "]

这是基于您的 HTML 的第二组:

h1_blocks[2].map(&:to_html)
# => ["<h1><span>title m</span></h1>",
#     "nn    ",
#     "<h2><span>subtitle m-1</span></h2>",
#     "n    ",
#     "<p></p>",
#     "n    ",
#     "<table class="fuga"><span>data m-1</span></table>",
#     "n    ",
#     "<p></p>",
#     "nn    //(the same structure repeated l times)nn    ",
#     "<h2><span>subtitle m-(l+2)</span></h2>",
#     "n    ",
#     "<p></p>",
#     "n    ",
#     "<table class="fuga"><span>data m-(l+2)</span></table>",
#     "n    ",
#     "<p></p>",
#     "nnn"]

这是如何工作的?

Ruby 的 Enumerable 类具有slice_before,它查看比较,对于每个真实结果,将传入数组分解为一个新的子数组。当我们有一个数组元素列表并且我们必须将该数组分解为单独的块时,这很有用。

通常,我们在解析具有某种重复块的文本时使用它,我们需要将其作为块处理,例如段落,网络设备接口等。

一旦节点通过获取<div id="hoge">标签的子节点进行分块,然后将它们传递到map这会将它们变回 NodeSets,从而可以轻松地继续像在 Nokogiri 中一样处理它们。

相关内容

  • 没有找到相关文章

最新更新