如何使用Nokogiri在两个HTML注释之间抓取HTML

我有一些HTML页面，其中要提取的内容标有HTML注释，如下所示。

<html>
 .....
<!-- begin content -->
 <div>some text</div>
 <div><p>Some more elements</p></div>
<!-- end content -->
...
</html>

我正在使用Nokogiri并尝试在和注释之间提取HTML。

我想提取这两个 HTML 注释之间的完整元素：

<div>some text</div>
<div><p>Some more elements</p></div>

我可以使用以下字符回调获取纯文本版本：

class TextExtractor < Nokogiri::XML::SAX::Document
  def initialize
    @interesting = false
    @text = ""
    @html = ""
  end
  def comment(string)
    case string.strip        # strip leading and trailing whitespaces
    when /^begin content/      # match starting comment
      @interesting = true
    when /^end content/
    @interesting = false   # match closing comment
  end
  def characters(string)
    @text << string if @interesting
  end
end

我得到了带有@text的纯文本版本，但我需要存储在@html中的完整 HTML。

在

两个节点之间提取内容不是我们常做的事情;通常，我们希望内容位于特定节点内。注释是节点，它们只是特殊类型的节点。

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<body>
<!-- begin content -->
 <div>some text</div>
 <div><p>Some more elements</p></div>
<!-- end content -->
</body>
EOT

通过查找包含指定文本的注释，可以找到起始节点：

start_comment = doc.at("//comment()[contains(.,'begin content')]") # => #<Nokogiri::XML::Comment:0x3fe94994268c " begin content ">

找到后，需要一个循环来存储当前节点，然后查找下一个同级节点，直到找到另一个注释：

content = Nokogiri::XML::NodeSet.new(doc)
contained_node = start_comment.next_sibling
loop do
  break if contained_node.comment?
  content << contained_node
  contained_node = contained_node.next_sibling
end
content.to_html # => "n <div>some text</div>n <div><p>Some more elements</p></div>n"

相关内容

最新更新

热门标签：