如何在使用Nokogiri抓取内容时排除嵌套元素?

我有一个页面的内容看起来像这样:

<div id="level1">
    <div id="level2">
        <div id="level3">Crap i dont care about</div>
        Here is some text i want
        <br />
        Here is some more text i want
        <br />
        Oh i want this text too :)
    </div>
</div>

我的目标是捕获#level2中的文本，但#level3 <div>嵌套在其中，与我想要的文本处于同一级别。

是否可能以某种方式排除<div> ?我应该在解析之前修改文档并删除元素吗?

require 'nokogiri'
xml = <<-XML
<div id="level1">
    <div id="level2">
        <div id="level3">Crap i dont care about</div>
        Here is some text i want
        <br />
        Here is some more text i want
        <br />
        Oh i want this text too :)
    </div>
</div>
XML
page = Nokogiri::XML(xml)
p page.xpath("//*[@id='level3']").remove.xpath("//*[@id='level2']").inner_text
# => "n        n        Here is some text i wantn        n        Here is some more text i wantn        n        Oh i want this text too :)n    "

现在，如果您愿意，可以清除输出文本。

如果你的HTML片段是在html，那么你可以这样做:

doc = Nokogiri::HTML(html)
div = doc.at_css('#level2')   # Extract <div id="level2">
div.at_css('#level3').remove  # Remove <div id="level3">
text_you_want = div.inner_text

您也可以使用XPath，但我发现CSS选择器对于这种简单的情况更简单。

相关内容

最新更新

热门标签：