如何让Nokogiri inner_HTML对象忽略/删除转义序列

目前，我正在尝试使用nokogiri获取页面上元素的内部HTML。然而，我不仅得到了元素的文本，还得到了它的转义序列。有没有一种方法我可以压制或删除他们与nokogiri？

require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://the.page.url.com"))
page.at_css("td[custom-attribute='foo']").parent.css('td').css('a').inner_html

返回=>"rnttttttttTheActuallyInnerContentThatIWantrnt"

最有效和最直接的方法是什么？

page.at_css("td[custom-attribute='foo']")
    .parent
    .css('td')
    .css('a')
    .text               # since you need a text, not inner_html
    .strip              # this will strip a result

CCD_ 2。

旁注：css('td a')可能比css('td').css('a')更有效率。

深入到包含所需文本的最近节点非常重要。考虑一下：

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
  </body>
</html>
EOT
doc.at('body').inner_html # => "n    <p>foo</p>n  "
doc.at('body').text # => "n    foon  "
doc.at('p').inner_html # => "foo"
doc.at('p').text # => "foo"

at、at_css和at_xpath返回一个Node/XML:：元素。search、css和xpath返回NodeSet。在查看节点或节点集时，text或inner_html返回信息的方式有很大差异：

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT
doc.at('p') # => #<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>, #<Nokogiri::XML::Element:0x3fd635cf32bc name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf30dc "bar">]>]
doc.at('p').class # => Nokogiri::XML::Element
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.at('p').text # => "foo"
doc.search('p').text # => "foobar"

请注意，使用search返回了一个NodeSet，text返回了连接在一起的节点文本。这很少是你想要的。

还要注意的是，Nokogiri足够聪明，99%的时间都能判断选择器是CSS还是XPath，因此对任何一种类型的选择器使用通用search和at都非常方便。

相关内容

最新更新

热门标签：