Nokogiri 抓取带有格式和链接标签的文本,,<em><strong>,<a>等



如何使用Nokogiri递归捕获所有带有格式化标签的文本?

<div id="1">
  This is text in the TD with <strong> strong </strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>

例如,我想捕获:

"This is text in the TD with <strong> strong </strong> tags" 
"This is a child node. with <b> bold </b> tags"
"another line of text to a <a href="link.html"> link </a>"
"This is text inside a div <em>inside<em> another div inside a paragraph tag"

我不能只使用。text(),因为它剥离了格式化标记,我不确定如何递归地做到这一点。

添加细节:Sanitize看起来像一个有趣的宝石,我现在正在阅读它。然而,有一些额外的信息,可能会澄清我需要做什么。

我需要遍历每个节点,获取文本,处理它并将其放回。因此,我会从"这是带有标签的TD中的文本"中抓取文本,将其修改为类似于"这是带有标签的TD中的修改文本"。然后从div 1中找到下一个标签,获取

文本。"这是一个子节点。with bold tags" modify it "这是一个修改的子节点。用粗体标签。"然后放回去。转到下一个div#2,抓取文本,"另一行文本到链接",修改它,"另一行修改文本到链接",把它放回去,转到下一个节点,div#2,从段落标签抓取文本。"这是修改后的文本在一个div内的另一个div内的段落标签"

所以在处理完所有内容后,新的HTML应该看起来像这样…

<div id="1">
  This is modified text in the TD with <strong> strong </strong> tags
  <p>This is a modified child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of modified text to a <a href="link.html"> link </a>"
      <p> This is modified text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>

我的准代码,但是我真的被这两部分卡住了,只抓取带有格式化的文本(这是sanitize的帮助),但是sanitize抓取所有的标签。我需要保留文本的格式,包括空格等。但是,不要抓取不相关的标签子。第二,遍历所有与全文标签直接相关的子标签。

#Quasi-code
doc = Nokogiri.HTML(html)
kids=doc.at('div#1')
text_kids=kids.descendant_elements
text.kids.each do |i|
   #grab full text(full sentence and paragraphs) with formating tags
   #currently, I have not way to grab just the text with formatting and not the other tags
   modified_text=processing_code(i.full_text_w_formating())
   i.full_text_w_formating=modified_text
end
def processing_code(string)
#code to process string (not relevant for this example)
  return modified_string
end

# Recursive 1
class Nokogiri::XML::Node
  def descendant_elements
  #This is flawed because it grabs every child and even 
  #splits it based on any tag.
  # I need to traverse down only the text related children.
  element_children.map{ |kid|
     [kid, kid.descendant_elements]
  }.flatten
  end
 end

我会使用两种策略,Nokogiri提取你想要的内容,然后黑名单/白名单程序剥离你不想要的标签或保留你想要的。

require 'nokogiri'
require 'sanitize'
html = '
<div id="1">
  This is text in the TD with <strong> strong <strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>
'
doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html

将捕获<div id="1">的内容作为HTML字符串:

      This is text in the TD with <strong> strong <strong> tags
      <p>This is a child node. with <b> bold </b> tags</p>
      <div id="2">
          "another line of text to a <a href="link.html"> link </a>"
          <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
      </div>
    </strong></strong>

后面的</strong></strong>是两个开始的<strong>标签的结果。这可能是故意的,但没有结束标签Nokogiri将做一些修复,使HTML正确。

html_fragment传递给Sanitize宝石:

doc = Sanitize.clean(
  html_fragment,
  :elements   => %w[ a b em strong ],
  :attributes => {
    'a'    => %w[ href ],
  },
)

返回的文本如下:

 This is text in the TD with <strong> strong <strong> tags
  This is a child node. with <b> bold </b> tags 
      "another line of text to a <a href="link.html"> link </a>"
        This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em> 
</strong></strong>

同样,由于HTML格式不正确,没有关闭</strong>标记,因此出现了两个尾随的关闭标记。

相关内容

  • 没有找到相关文章

最新更新