我有一个要解析的HTML表。我想向下移动每个<TR>
并提取href。HTML如下所示:
table id="classified_table" class="vs-classified-table widget-off top" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td id="classified_cell">
<table class="vs-classified-table widget-off" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr id="vs_classified_73634384" class="classified row1 kiwii-clad-row kiwii-clad-featured">
<tr id="vs_classified_74530668" class="classified row2 kiwii-clad-row kiwii-clad-featured">
<tr id="vs_classified_62296263" class="classified row3 kiwii-clad-row kiwii-clad-featured">
<tr id="vs_classified_62468547" class="classified row4 kiwii-clad-row kiwii-clad-featured">
<tr id="vs_classified_47122034" class="classified row5 kiwii-clad-row kiwii-clad-featured">
<tr id="vs_classified_78210646" class="classified row6 kiwii-clad-row">
<tr id="vs_classified_78207083" class="classified row7 kiwii-clad-row">
<tr id="vs_classified_69104369" class="classified row8 kiwii-clad-row">
<tr id="vs_classified_78113204" class="classified row9 kiwii-clad-row">
<tr id="vs_classified_52761813" class="classified row10 kiwii-clad-row">
<tr id="vs_classified_78121746" class="classified row11 kiwii-clad-row">
<tr id="vs_classified_76515548" class="classified row12 kiwii-clad-row">
<tr id="vs_advert_middle" class="vs-advertisement advertisment-middle-2 vs-adsense-middle-BR-" style="border:none">
<tr id="vs_classified_34048811" class="classified row13 kiwii-clad-row">
我的Ruby代码如下:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open('http://servico-informatica.vivanuncios.com/computador+rio-de-janeiro-capital/'))
rows = page.css('tr#vs_classified_73634384.classified td.summary div a#vs-detail-link-1.kiwii-clear-none')
puts rows.text
#this works
rows [1..10].each do |row|
puts "this isn't working :("
end
第一次打印成功打印了第一个<TR>
的文本,但each
循环中的puts
不起作用。
我要刮的页面是:http://servico-informatica.vivanuncios.com/computador+里约热内卢资本/
您只收到一个结果,因为您的css查询正在使用#
,这意味着它正在页面上寻找唯一的元素(Spec)。
因此,您需要修改查询以查找基于css类的href。
tr.classified td.summary a.classified-link
更新
上面的css路径将获取所有链接,然后您只需要遍历数组并对href和文本执行所需操作。
require 'rubygems'
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open('http://servico-informatica.vivanuncios.com/computador+rio-de-janeiro-capital/'))
links = page.css("tr.classified td.summary a.classified-link")
links.map do |link|
puts link['href']
puts link.content
end
我不知道你期望它做什么:
rows [1..10].each do |row|
puts "this isn't working :("
end
但我敢肯定它不会像你期望的那样。这实际上被解释为:
rows[1..10].each { ... }
由于rows
(它是Nokogiri::XML::NodeSet
)只有一个条目,尝试提取从1
开始的子集会得到一个空的NodeSet
;这意味着你实际上只是在说:
some_empty_node_set.each { ... }
而这毫无用处。但是,如果您查看rows
中的第一个条目,就会发现您要查找的href
:
rows[0]['href']
# "http://servico-informatica.vivanuncios.com/..."
你也可以根据口味和适合你的需求来选择rows.attr('href')
或rows.first['href']
。