如何使用 Ruby 解析元素后的 HTML 文本

如何使用Ruby解析和分组示例HTML？

网页文本：

<h2>heading one</h2>
<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>
<h2>heading two</h2>
<p>different content in here <a>test</a> <b>test</b></p>
<h2>heading three</h2>
<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>

元素不是嵌套的，我想按标题对它们进行分组。当我找到一个<h2>时，我想按原样提取它的文本和它后面的所有内容，直到遇到下一个<h2>。最后一个标题没有另一个 h2 作为分隔符。

这是示例输出：

- Heading one
"<p>different content in here <a>test</a> <b>test</b></p>
<p>different content in here <a>test</a> <b>test</b></p>"
- Heading 2
"<p>different content in here <a>test</a> <b>test</b></p>"

你可以用Nokogiri非常快速地做到这一点，而不必用正则表达式解析你的HTML。

您将能够获取h2元素，然后提取其中的内容。

一些例子在 https://www.rubyguides.com/2012/01/parsing-html-in-ruby/

这应该有效，
组 1 包含标题文本，组 2 包含正文。

包括空格修剪

/<h2s*>s*([Ss]*?)s*</h2s*>s*([Ss]*?)(?=s*<h2s*>|s*$)/

https://regex101.com/r/pgLIi0/1

可读正则表达式

<h2 s* >
s*     
( [Ss]*? )                  # (1) Heading
s* 
</h2 s* >
s*   
( [Ss]*? )                  # (2) Body
(?= s* <h2 s* > | s* $ )

强烈建议不要尝试这样做，"正则表达式匹配开放标签，但XHTML自包含标签除外"有助于解释原因。只有在您拥有代码生成的最微不足道的情况下，才应使用模式。如果您不拥有生成器，那么HTML中的任何更改都可能破坏您的代码，通常是以无法修复的方式，尤其是在深夜的关键中断期间，您的老板会追捕您以使其立即运行。

使用Nokogiri，这将使您以更强大和推荐的方式进入球场。此示例仅收集h2和后续p节点。弄清楚如何显示它们只是一项练习。

require 'nokogiri'
html = <<EOT
<h2>heading 1</h2>
<p>content 1a<b>test</b></p>
<p>content 1b</p>
<h2>heading 2</h2>
<p>content 2a</p>
EOT
doc = Nokogiri::HTML.parse(html)
output = doc.search('h2').map { |h|
next_node = h.next_sibling
break unless next_node
paragraphs = []
loop do
case 
when next_node.text? && next_node.blank?
when next_node.name == 'p'
paragraphs << next_node 
else
break
end
next_node = next_node.next_sibling
break unless next_node
end
[h, paragraphs]
}

这导致output包含包含节点的数组数组：

# => [[#(Element:0x3ff4e4034be8 {
#        name = "h2",
#        children = [ #(Text "heading 1")]
#        }),
#      [#(Element:0x3ff4e4034b98 {
#         name = "p",
#         children = [
#           #(Text "content 1a"),
#           #(Element:0x3ff4e3807ccc {
#             name = "b",
#             children = [ #(Text "test")]
#             })]
#         }),
#       #(Element:0x3ff4e4034ad0 {
#         name = "p",
#         children = [ #(Text "content 1b")]
#         })]],
#     [#(Element:0x3ff4e4034a6c {
#        name = "h2",
#        children = [ #(Text "heading 2")]
#        }),
#      [#(Element:0x3ff4e40349a4 {
#         name = "p",
#         children = [ #(Text "content 2a")]
#         })]]]

该代码还对 HTML 的格式进行了一些假设，但如果格式发生变化，则不会吐出垃圾。它采用如下格式：

<h2>
<p>
...

其中h2始终跟p标签，直到出现其他标签，包括后续h2。

此测试：

when next_node.text? && next_node.blank?

是必要的，因为 HTML 不需要格式化，但是当它插入时，插入的"TEXT"节点仅包含空格，这会导致我们期望的缩进"漂亮的 HTML"。解析器和浏览器不在乎它是否存在，除非是预先格式化的文本，只有人类这样做。实际上，最好不要使用它们，因为它们会使文件膨胀并减慢文件的传输速度。但人们是这样挑剔的。实际上，代码中的 HTML 示例看起来更像：

<h2>heading 1</h2>n<p>content 1a<b>test</b></p>n<p>content 1b</p>nn<h2>heading 2</h2>n<p>content 2a</p>n

而when语句忽略了那些"n"节点。

相关内容

最新更新

热门标签：