试图处理div
元素中的不规则内容。也就是h3
标题之后的内容。h3标题下没有固定的内容。但是,我需要将任何文本与标题关联起来。可以是ul,也可以是span,或者两者都有。最重要的是不要在h3标题下合并所有的文本。
我已经能够导航到我的div
使用。css操作符。每个div
包含4个h3
标题中的一个或多个,后面跟着一个注释,如果有多个注释,则包含一个列表。
我怎么能分开任何跟随h3
标签结束前的下一个标签(如果有一个)?
你可以看到我在这里使用的div
的示例(我可以抓取h2
之间的任何内容,因为每个div
都是一样的):
<div class="inspection_container">
<h2 class="inspection_date_title">
<div class="calendar_list">
<span>Mar</span><strong>4</strong>
</div>Routine Inspection<small>Inspected Mar. 4, 2014</small>
</h2>
<h3>Actions taken by inspector</h3>
<ul>
<li class="Comment">
<strong>Consultation / Technical Assistance</strong><p>Instructions are given to the owner/operator to assist them with taking the proper actions to meet regulations.</p>
</li>
</ul>
</div>
<div class="inspection_container">
<h2 class="inspection_date_title">
<div class="calendar_list">
<span>Sep</span><strong>4</strong>
</div>Re-inspection<small>Inspected Sep. 4, 2013</small>
</h2>
<h3>Not in compliance</h3>
<ul>
<li class="X">
<strong>Premise is clean/sanitary</strong><p>Food premise is to be maintained in a clean and sanitary condition.</p>
</li>
</ul>
<h3>Actions taken by inspector</h3>
<ul>
<li class="Comment">
<strong>Consultation / Technical Assistance</strong><p>Instructions are given to the owner/operator to assist them with taking the proper actions to meet regulations.</p>
</li>
</ul>
</div>
<div class="inspection_container">
<h2 class="inspection_date_title">
<div class="calendar_list">
<span>Aug</span><strong>30</strong>
</div>Routine Inspection<small>Inspected Aug. 30, 2013</small>
</h2>
<h3>Not in compliance</h3>
<ul>
<li class="X">
<strong>Washrooms are cleaned regularly</strong><p>Washrooms are to be kept clean, sanitary, in good repair and must be supplied with liquid soap in a dispenser, single service/paper towels, cloth roller towel or hot air dryer and hot and cold running water.</p>
</li>
<li class="X">
<strong>Building interior is well-maintained</strong><p>Walls, floors and ceilings are to be maintained and in good repair.</p>
</li>
<li class="X">
<strong>Premise is clean/sanitary</strong><p>Food premise is to be maintained in a clean and sanitary condition.</p>
</li>
</ul>
<h3>Actions taken by inspector</h3>
<ul>
<li class="Comment">
<strong>Consultation / Technical Assistance</strong><p>Instructions are given to the owner/operator to assist them with taking the proper actions to meet regulations.</p>
</li>
</ul>
</div>
前提是:
- 你只有交织在一起的
h3
和ul
元素,直到包装div - 没有其他元素可以出现在这个结构中,而不是
ul
- 没有其他元素可以代替
h3
出现在这个结构中
和你的例子是有代表性的,这应该可以做到。
//ul[count(following-sibling::h3) = count(following-sibling::ul)]
如果在与ul
相同的位置有其他元素,但h3
之间总是只有一个元素,则可以使用此表达式
//ul[count(following-sibling::h3) = count(following-sibling::*[not(local-name() = 'h3')])]
至于立即分组h3
元素和紧跟其后的ul
元素,我认为单独使用XPath是不可行的。您需要在Ruby中完成此操作。我建议搜索div
元素并强制解析它们,同时计数节点并将奇数和偶数h3
s和ul
s分组在一起