我有一个HTML文档,其结构如下:
<li class="indent1">(something)
<li class="indent2">(something else)</li>
<li class="indent2">(something else)
<li class="indent3">(another sublevel)</li>
</li>
<li class="indent2">(something else)</li>
</li>
我需要做的是将这些 LI 标签包装在 OL 标签中。整个文档中有许多这样的列表。HTML 需要如下所示:
<ol>
<li>(something)
<ol>
<li>(something else)</li>
<li>(something else)
<ol>
<li>(another sublevel)</li>
</ol>
</li>
<li>(something else)</li>
</ol>
</li>
</ol>
我该如何在野古吉里做到这一点?提前非常感谢。
编辑:
下面是原始文档中的 HTML 示例。我的脚本将所有 P 标签转换为 LI 标签。
<p class="indent1"><i>a.</i> This regulation describes the Army Planning, Programming,
Budgeting, and Execution System (PPBES). It explains how an integrated Secretariat and
Army Staff, with the full participation of major Army commands (MACOMs), Program
Executive Offices (PEOs), and other operating agencies--</p>
<p class="indent2">(1) Plan, program, budget, and then allocate and manage approved
resources.</p>
<p class="indent2">(2) Provide the commanders in chief (CINCs) of United States unified
and specified commands with the best mix of Army forces, equipment, and support
attainable within available resources.</p>
<p class="indent1"><i>b.</i> The regulation assigns responsibilities and describes
policy and procedures for using the PPBES to:</p>
缩进 1 类表示第一级列表项,缩进 2 表示第二级列表项,依此类推。我需要将这些缩进类转换为正确的有序列表。
以下解决方案的工作原理是遍历文档中的每个<li>
,然后执行以下任一操作:
- 如果没有前面的
<ol>
,请将<li>
换成新的,然后将<li>
放在那里。 - 如果前面有
<ol>
,请将此<li>
移入其中。
document.css('li').each do |li|
if li.at_xpath('preceding-sibling::node()[not(self::text()[not(normalize-space())])][1][self::ol]')
li.previous_element << li
else
li.replace('<ol/>').first << li
end
end
这是,经过测试:
require 'nokogiri'
# Use XML instead of HTML fragment due to problems with XPath
fragment = Nokogiri::XML.fragment '
<li>List 1
<li>List 1a</li>
<li>List 1b
<li>List 1bi</li>
</li>
<li>List 1c</li>
New List
<li>New List 1a</li>
</li>
<p>Break 1</p>
<li>List 2a</li>
<li>List 2b</li>
<p>Break 2</p>
<li>List 3 <li>List 3a</li></li>
'
fragment.css('li').each do |li|
# Complex test to see if the preceding element is an <ol> and there's no non-empty text the li and it
# See http://stackoverflow.com/q/14045519/405017
if li.at_xpath('preceding-sibling::node()[not(self::text()[not(normalize-space())])][1][self::ol]')
li.previous_element << li
else
li.replace('<ol/>').first << li
end
end
puts fragment # I've normalized the whitespace in the output to make it clear
#=> <ol>
#=> <li>List 1
#=> <ol>
#=> <li>List 1a</li>
#=> <li>List 1b
#=> <ol>
#=> <li>List 1bi</li>
#=> </ol>
#=> </li>
#=> <li>List 1c</li>
#=> </ol>
#=> New List
#=> <ol><li>New List 1a</li></ol>
#=> </li>
#=> </ol>
#=> <p>Break 1</p>
#=> <ol>
#=> <li>List 2a</li>
#=> <li>List 2b</li>
#=> </ol>
#=> <p>Break 2</p>
#=> <ol>
#=> <li>List 3
#=> <ol>
#=> <li>List 3a</li>
#=> </ol>
#=> </li>
#=> </ol>
问题是您的 html 格式不正确。您无法使用 nokogiri 成功解析它。