红宝石轨道 - 如何使用 nokogiri 从列表中获取"ASIN"标签?



我正在尝试使用nokogiri从amazon html页面获得ASIN编号,但我没有使用xpath的运气。我已经用firepath试过了,我仍然没有得到任何东西。它只是得到URL,然后运行一个ruby REGEX得到ASIN更好吗?如果是这样,正则表达式会是什么样子?

#!/usr/bin/env ruby -w
require 'nokogiri'
require 'open-uri'
url = "http://www.amazon.com/gp/new-releases/books/3839/ref=zg_bsnr_nav"
doc = Nokogiri::HTML(open(url))
puts doc.xpath('//zg_list').each do | node|
  p node['asin']
end

这是我打印出url的结果

#!/usr/bin/env ruby -w
require 'nokogiri'
require 'open-uri'
url = "http://www.amazon.com/gp/new-releases/books/3839/ref=zg_bsnr_nav"
doc = Nokogiri::HTML(open(url))
l = doc.css('div.zg_image a').map { |link| 
  link['href'] 
  }
puts l # => /Introducing-ZBrush-4-Eric-Keller/dp/0470527641/ref=zg_bsnr_3839_20/183-0702383-0095048

对我来说,使用Nokogiri中的css方法要比使用XPath容易得多。给定您发布的URL上的HTML,下面的代码应该为每个条目检索"asin"属性:

doc.css("div.zg_item").map { |e| e["asin"] }
我认为正确的XPath应该是这样的:
doc.xpath("//div[contains(@class, 'zg_item') and @asin]")

您可以使用CSS访问器或XPath:

#!/usr/bin/env ruby -w
require 'nokogiri'
require 'open-uri'
url = "http://www.amazon.com/gp/new-releases/books/3839/ref=zg_bsnr_nav"
doc = Nokogiri::HTML(open(url))
# CSS
# puts doc.search('div[class="zg_item zg_sparseListItem"]').each { |n| p n['asin'] }
# XPath
puts doc.search('//div[@class="zg_item zg_sparseListItem"]').each { |n| p n['asin'] }
# >> "1934356549"
# >> "0596802471"
# >> "B004M8T01Q"
# >> "0596809158"
# >> "0470943327"
# >> "B004MMEJ36"
# >> "1935182641"
# >> "B004RDOPJI"
# >> "1449390501"
# >> "1449389716"
# >> "B004IWRH4I"
# >> "0470527641"
# >> "0735650926"
# >> "1430231475"
# >> "0321751043"
# >> "B004NBZ65G"
# >> "B004TMNSJK"
# >> "0132091518"
# >> "144030842X"
# >> "1430234040"
# >> 0

相关内容

  • 没有找到相关文章

最新更新