Nokogiri-如果它没有唯一的TR名称,我如何解析此维基百科表?



这是我到目前为止的…问题是,它正在生成一个JSON文件,看起来像(见下文)。我的问题是,当我检查页面上的代码时,我没有看到css选择器的任何独特之处。他们都是简单的测试,任何提示将不胜感激。

谢谢!

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
require 'json'
sammiches = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/List_of_sandwiches"))
class Scraper
def initialize
 @url = "http://en.wikipedia.org/wiki/List_of_sandwiches"
 @nodes = Nokogiri::HTML(open(@url))
end
def summary(filename)
 sammich_data = @nodes
 sammiches = sammich_data.css('div.mw-content-ltr table.wikitable tr') 

 sammich_hashes = sammiches.map {|x| 
   name = x.css('td a').text
   image = x.css('td a.image').text
   country = x.css('td a').text
   description = x.css('td a').text
 {
  :name => name,
  :image => image,
  :country => country,
  :description => description,
  }
    }
File.open("public/#{filename}","w") do |f|
 f.write(JSON.pretty_generate(sammich_hashes))
 end   
 end
 sammy = Scraper.new
 puts sammy.summary('listy')
 end

Json文件输出部分

[
{
"name": "",
"image": "",
"country": "",
"description": ""
},
{
"name": "BaconUnited Kingdomketchupbrown sauce",
"image": "",
"country": "BaconUnited Kingdomketchupbrown sauce",
"description": "BaconUnited Kingdomketchupbrown sauce"
},
{
"name": "Bacon, egg and cheesebreakfast sandwich",
"image": "",
"country": "Bacon, egg and cheesebreakfast sandwich",
"description": "Bacon, egg and cheesebreakfast sandwich"

使用td索引:

name = x.at('td[1]').text
country = x.at('td[3]').text

你可能想先删除引用:

sammich_data.search('sup').remove

与其解析Wikipedia的HTML,不如利用它们的API,它将为您提供XML、JSON或其他格式的数据。它更干净,更可重复使用。

您甚至可以使用HTML来呈现没有所有边框和框的页面。

相关内容

  • 没有找到相关文章

最新更新