这是我到目前为止的…问题是,它正在生成一个JSON文件,看起来像(见下文)。我的问题是,当我检查页面上的代码时,我没有看到css选择器的任何独特之处。他们都是简单的测试,任何提示将不胜感激。
谢谢!
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
require 'json'
sammiches = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/List_of_sandwiches"))
class Scraper
def initialize
@url = "http://en.wikipedia.org/wiki/List_of_sandwiches"
@nodes = Nokogiri::HTML(open(@url))
end
def summary(filename)
sammich_data = @nodes
sammiches = sammich_data.css('div.mw-content-ltr table.wikitable tr')
sammich_hashes = sammiches.map {|x|
name = x.css('td a').text
image = x.css('td a.image').text
country = x.css('td a').text
description = x.css('td a').text
{
:name => name,
:image => image,
:country => country,
:description => description,
}
}
File.open("public/#{filename}","w") do |f|
f.write(JSON.pretty_generate(sammich_hashes))
end
end
sammy = Scraper.new
puts sammy.summary('listy')
end
Json文件输出部分
[
{
"name": "",
"image": "",
"country": "",
"description": ""
},
{
"name": "BaconUnited Kingdomketchupbrown sauce",
"image": "",
"country": "BaconUnited Kingdomketchupbrown sauce",
"description": "BaconUnited Kingdomketchupbrown sauce"
},
{
"name": "Bacon, egg and cheesebreakfast sandwich",
"image": "",
"country": "Bacon, egg and cheesebreakfast sandwich",
"description": "Bacon, egg and cheesebreakfast sandwich"
使用td索引:
name = x.at('td[1]').text
country = x.at('td[3]').text
你可能想先删除引用:
sammich_data.search('sup').remove
与其解析Wikipedia的HTML,不如利用它们的API,它将为您提供XML、JSON或其他格式的数据。它更干净,更可重复使用。
您甚至可以使用HTML来呈现没有所有边框和框的页面。