Nokogiri:不规则跳水



试图处理div元素中的不规则内容。也就是h3标题之后的内容。h3标题下没有固定的内容。但是,我需要将任何文本与标题关联起来。可以是ul,也可以是span,或者两者都有。最重要的是不要在h3标题下合并所有的文本。

我已经能够导航到我的div使用。css操作符。每个div包含4个h3标题中的一个或多个,后面跟着一个注释,如果有多个注释,则包含一个列表。

我怎么能分开任何跟随h3标签结束前的下一个标签(如果有一个)?

你可以看到我在这里使用的div的示例(我可以抓取h2之间的任何内容,因为每个div都是一样的):

   <div class="inspection_container">
  <h2 class="inspection_date_title">
    <div class="calendar_list">
      <span>Mar</span><strong>4</strong>
    </div>Routine Inspection<small>Inspected Mar. 4, 2014</small>
  </h2>
  <h3>Actions taken by inspector</h3>
  <ul>
    <li class="Comment">
      <strong>Consultation / Technical Assistance</strong><p>Instructions are given to the owner/operator to assist them with taking the proper actions to meet regulations.</p>
    </li>
  </ul>
</div>
<div class="inspection_container">
  <h2 class="inspection_date_title">
    <div class="calendar_list">
      <span>Sep</span><strong>4</strong>
    </div>Re-inspection<small>Inspected Sep. 4, 2013</small>
  </h2>
  <h3>Not in compliance</h3>
  <ul>
    <li class="X">
      <strong>Premise is clean/sanitary</strong><p>Food premise is to be maintained in a clean and sanitary condition.</p>
    </li>
  </ul>
  <h3>Actions taken by inspector</h3>
  <ul>
    <li class="Comment">
      <strong>Consultation / Technical Assistance</strong><p>Instructions are given to the owner/operator to assist them with taking the proper actions to meet regulations.</p>
    </li>
  </ul>
</div>
<div class="inspection_container">
  <h2 class="inspection_date_title">
    <div class="calendar_list">
      <span>Aug</span><strong>30</strong>
    </div>Routine Inspection<small>Inspected Aug. 30, 2013</small>
  </h2>
  <h3>Not in compliance</h3>
  <ul>
    <li class="X">
      <strong>Washrooms are cleaned regularly</strong><p>Washrooms are to be kept clean, sanitary, in good repair and must be supplied with liquid soap in a dispenser, single service/paper towels, cloth roller towel or hot air dryer and hot and cold running water.</p>
    </li>
    <li class="X">
      <strong>Building interior is well-maintained</strong><p>Walls, floors and ceilings are to be maintained and in good repair.</p>
    </li>
    <li class="X">
      <strong>Premise is clean/sanitary</strong><p>Food premise is to be maintained in a clean and sanitary condition.</p>
    </li>
  </ul>
  <h3>Actions taken by inspector</h3>
  <ul>
    <li class="Comment">
      <strong>Consultation / Technical Assistance</strong><p>Instructions are given to the owner/operator to assist them with taking the proper actions to meet regulations.</p>
    </li>
  </ul>
</div>

前提是:

  • 你只有交织在一起的h3ul元素,直到包装div
  • 没有其他元素可以出现在这个结构中,而不是ul
  • 没有其他元素可以代替h3出现在这个结构中

和你的例子是有代表性的,这应该可以做到。

//ul[count(following-sibling::h3) = count(following-sibling::ul)]

如果在与ul相同的位置有其他元素,但h3之间总是只有一个元素,则可以使用此表达式

//ul[count(following-sibling::h3) = count(following-sibling::*[not(local-name() = 'h3')])]

至于立即分组h3元素和紧跟其后的ul元素,我认为单独使用XPath是不可行的。您需要在Ruby中完成此操作。我建议搜索div元素并强制解析它们,同时计数节点并将奇数和偶数h3 s和ul s分组在一起

相关内容

  • 没有找到相关文章

最新更新