获取删除所有标记(及其内容)的段落文本

我怎样才能只得到节点的文本，其中有其他标记，如:

<p>hello my website is <a href="www.website.com">click here</a> <b>test</b></p>

我只想要" hello my website is "

这是我尝试过的:

begin
  node = html_doc.css('p')
  node.each do |node|
    node.children.remove
  end
  return (node.nil?) ? ''  : node.text
rescue
  return ''
end

更新2:好吧，你正在删除node.children.remove的所有子节点，包括文本节点，建议的解决方案可能看起来像:

# 1. select all <p> nodes
doc.css('p').
  # 2. map children, and flatten
  map { |node| node.children }.flatten.
  # 3. select text nodes only
  select { |node| node.text? }.
  # 4. get text and join
  map { |node| node.text }.join(' ').strip

此示例返回"你好，我的网站是"，但请注意，doc.css('p')也在标签中找到标签。

更新:对不起，误解了你的问题，你只想要"你好，我的网站是"，见上面的解决方案，原始答案:

不直接使用nokogiri，但可以选择使用消毒gem: https://github.com/rgrove/sanitize/

Sanitize.clean(html, {}) # => " hello my website is click here test "

供参考，它内部使用nokogiri

您的测试用例没有包含任何与标记交错的有趣文本。

如果你想把Hello World!变成"Hello !"，那么删除孩子是一种方法。更简单(且破坏性更小)的方法是找到所有文本节点并将它们连接起来:

require 'nokogiri'
html = Nokogiri::HTML('<p>Hello <b>World</b>!</p>')
# Find the first paragraph (in this case the only one)
para = html.at('p') 
# Find all the text nodes that are children (not descendants),
# change them from nodes into the strings of text they contain,
# and then smush the results together into one big string.
p para.search('text()').map(&:text).join 
#=> "Hello !"

如果你想把Hello World!变成"Hello "(没有感叹号)，那么你可以简单地做:

p para.children.first.text # if you know that text is the first child
p para.at('text()').text   # if you want to find the first text node

如@ 1所示，如果您愿意，可以使用String#strip方法从结果中删除前导/尾随空格。

有一种不同的方法。与其麻烦地删除节点，不如删除这些节点包含的文本:

require 'nokogiri'
doc = Nokogiri::HTML('<p>hello my website is <a href="www.website.com">click here</a> <b>test</b></p>')
text = doc.search('p').map{ |p|
  p_text = p.text
  a_text = p.at('a').text
  p_text[a_text] = ''
  p_text
}
puts text
>>hello my website is  test

这是一个简单的例子，但想法是找到标签，然后扫描其中包含您不想要的文本的标签。对于每个不需要的标签，抓取它们的文本并从周围的文本中删除。

在示例代码中，在a_text赋值处有一个不需要的节点列表，循环遍历它们，并迭代地删除文本，如下所示:

text = doc.search('p').map{ |p|
  p_text = p.text
  %w[a].each do |bad_nodes|
    bad_nodes_text = p.at(bad_nodes).text
    p_text[bad_nodes_text] = ''
  end
  p_text
}

您将返回text，这是节点的调整文本内容的数组。

相关内容

最新更新

热门标签：