我使用Nokogiri来解析我的HTML代码。我的HTML看起来像这样:
<table>
<tr>
<td>
<p>Important Preferences</p>
To see as much as possible
<br />Relaxation
<br />Quality of accommodation
<br />Quality of activities
<br />Independence & flexibility
<br />Safety & security
</td>
<td>
<p>Budget Preferences</p>
4000 to 5000 USD per person
<br />5000 to 6000 USD per person
<br />Above 6000 USD per person
</td>
</tr>
</table>
我正在尝试从它做一个哈希,这将是这样的:
{
"Important Preferences" => "To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security",
"Budget Preferences" => "4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person"
}
我试着:
params = {}
Nokogiri::HTML("my HTML pls see above").css("td p").each do |item|
params.merge!({item.text => item.next.text})
end
但是我不能在<BR>
中收集值。
结果是:
{
"Important Preferences" => "To see as much as possible",
"Budget Preferences" => "4000 to 5000 USD per person"
}
第一步找出所有带有xpath('//td')
的<td>
标签。然后,对于每个,迭代它的子元素并收集其内容,如果子元素是Nokogiri::XML::Text
(您不想收集<br>
标签):
doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
p = td.at_xpath('p')
a = []
td.children.each do |child|
if Nokogiri::XML::Text === child
t = child.text.strip
a << t unless t.empty?
end
end
h[p.text] = a.join(', ')
end
结果:{"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security",
"Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person"}
或更压缩的形式,不使用严格循环:
doc = Nokogiri::HTML.parse(html)
h = {}
doc.xpath('//td').each do |td|
h[td.at_xpath('p').text] = td.children
.select{|x| Nokogiri::XML::Text === x && !x.text.strip.empty?}
.map{|x| x.text.strip}.join(', ')
end
你基本上想要得到td p
的所有兄弟
您可以获得所有兄弟姐妹的列表并删除p
。
item.parent.children.to_a - [item]
我想这样做:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<table>
<tr>
<td>
<p>Important Preferences</p>
To see as much as possible
<br />Relaxation
<br />Quality of accommodation
<br />Quality of activities
<br />Independence & flexibility
<br />Safety & security
</td>
<td>
<p>Budget Preferences</p>
4000 to 5000 USD per person
<br />5000 to 6000 USD per person
<br />Above 6000 USD per person
</td>
</tr>
</table>
EOT
doc.search('td').map { |td|
key = td.at('p').text
[
key,
td.text.sub(/#{key}/, '').lstrip.gsub(/n +/, ', ')
]
}.to_h
# => {"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security, ", "Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person, "}
如果你使用的是没有to_h
的旧版本Ruby,请使用:
Hash[
doc.search('td').map { |td|
key = td.at('p').text
[
key,
td.text.sub(/#{key}/, '').lstrip.gsub(/n +/, ', ')
]
}
]
# => {"Important Preferences"=>"To see as much as possible, Relaxation, Quality of accommodation, Quality of activities, Independence & flexibility, Safety & security, ", "Budget Preferences"=>"4000 to 5000 USD per person, 5000 to 6000 USD per person, Above 6000 USD per person, "}