Nokogiri parsing HTML



我使用Nokogiri来解析我的HTML代码。我的HTML看起来像这样:

<table>
  <tr>
    <td>
      <p>Important Preferences</p>
      To see as much as possible
      <br />Relaxation
      <br />Quality of accommodation
      <br />Quality of activities
      <br />Independence & flexibility
      <br />Safety & security
    </td>
    <td>
      <p>Budget Preferences</p>
      4000 to 5000 USD per person
      <br />5000 to 6000 USD per person
      <br />Above 6000 USD per person
    </td>
  </tr>
</table>

我正在尝试从它做一个哈希,这将是这样的:

{
  "Important Preferences" => "To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security",
  "Budget Preferences" => "4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person"
}

我试着:

params = {}
Nokogiri::HTML("my HTML pls see above").css("td p").each do |item|
  params.merge!({item.text => item.next.text})
end

但是我不能在<BR>中收集值。

结果是:

{
  "Important Preferences" => "To see as much as possible",
  "Budget Preferences" => "4000 to 5000 USD per person"
 }

第一步找出所有带有xpath('//td')<td>标签。然后,对于每个,迭代它的子元素并收集其内容,如果子元素是Nokogiri::XML::Text(您不想收集<br>标签):

doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
  p = td.at_xpath('p')
  a = []
  td.children.each do |child|
    if Nokogiri::XML::Text === child
      t = child.text.strip
      a << t unless t.empty?
    end
  end
  h[p.text] = a.join(', ')
end
结果:

{"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security", 
 "Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person"}

或更压缩的形式,不使用严格循环:

doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
  h[td.at_xpath('p').text] = td.children
    .select{|x| Nokogiri::XML::Text === x && !x.text.strip.empty?}
    .map{|x| x.text.strip}.join(', ')
end

你基本上想要得到td p的所有兄弟

您可以获得所有兄弟姐妹的列表并删除p

item.parent.children.to_a - [item]

我想这样做:

require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<table>
  <tr>
    <td>
      <p>Important Preferences</p>
      To see as much as possible
      <br />Relaxation
      <br />Quality of accommodation
      <br />Quality of activities
      <br />Independence & flexibility
      <br />Safety & security
    </td>
    <td>
      <p>Budget Preferences</p>
      4000 to 5000 USD per person
      <br />5000 to 6000 USD per person
      <br />Above 6000 USD per person
    </td>
  </tr>
</table>
EOT
doc.search('td').map { |td|
  key = td.at('p').text
  [
    key,
    td.text.sub(/#{key}/, '').lstrip.gsub(/n +/, ', ')
  ]
}.to_h 
# => {"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security, ", "Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person, "}

如果你使用的是没有to_h的旧版本Ruby,请使用:

Hash[
  doc.search('td').map { |td|
    key = td.at('p').text
    [
      key,
      td.text.sub(/#{key}/, '').lstrip.gsub(/n +/, ', ')
    ]
  }
]
# => {"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security, ", "Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person, "}

相关内容

  • 没有找到相关文章

最新更新