我想解析一个非常大的文件240Mb,并且必须使用SAX来避免在内存中加载该文件。
我的XML看起来像:
<?xml version="1.0" encoding="utf-8"?>
<hotels>
<hotel>
<hotelId>1568054</hotelId>
<hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName>
<hotelName>"Der Obere Wirt" zum Queri</hotelName>
<rating>3</rating>
<cityId>34633</cityId>
<cityFileName>Andechs</cityFileName>
<cityName>Andechs</cityName>
<stateId>212</stateId>
<stateFileName>Bavaria</stateFileName>
<stateName>Bavaria</stateName>
<countryCode>DE</countryCode>
<countryFileName>Germany</countryFileName>
<countryName>Germany</countryName>
<imageId>51498149</imageId>
<Address>Georg Queri Ring 9</Address>
<minRate>85.9800</minRate>
<currencyCode>EUR</currencyCode>
<Latitude>48.009423000000</Latitude>
<Longitude>11.214504000000</Longitude>
<NumberOfReviews>16</NumberOfReviews>
<ConsumerRating>4.25</ConsumerRating>
<PropertyType>0</PropertyType>
<ChainID>0</ChainID>
<Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities>
</hotel>
<hotel>
<hotelId>1658359</hotelId>
<hotelFileName>Seclusions_of_Yallingup</hotelFileName>
<hotelName>"Seclusions" of Yallingup</hotelName>
<rating>4</rating>
<cityId>72257</cityId>
<cityFileName>Yallingup</cityFileName>
<cityName>Yallingup</cityName>
<stateId>172</stateId>
<stateFileName>Western_Australia</stateFileName>
<stateName>Western Australia</stateName>
<countryCode>AU</countryCode>
<countryFileName>Australia</countryFileName>
<countryName>Australia</countryName>
<imageId>53234107</imageId>
<Address>58 Zamia Grove</Address>
<minRate>218.1825</minRate>
<currencyCode>AUD</currencyCode>
<Latitude>-33.691192000000</Latitude>
<Longitude>115.061938999999</Longitude>
<NumberOfReviews>0</NumberOfReviews>
<ConsumerRating>0</ConsumerRating>
<PropertyType>3</PropertyType>
<ChainID>0</ChainID>
<Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities>
</hotel>
<hotel>
<hotelId>1491947</hotelId>
<hotelFileName>1_Melrose_Blvd</hotelFileName>
<hotelName>#1 Melrose Blvd</hotelName>
<rating>5</rating>
<cityId>964</cityId>
<cityFileName>Johannesburg</cityFileName>
<cityName>Johannesburg</cityName>
<stateId/>
<stateFileName/>
<stateName/>
<countryCode>ZA</countryCode>
<countryFileName>South_Africa</countryFileName>
<countryName>South Africa</countryName>
<imageId>46777171</imageId>
<Address>1 Melrose Boulevard Melrose Arch</Address>
<minRate/>
<currencyCode>ZAR</currencyCode>
<Latitude>-26.135656000000</Latitude>
<Longitude>28.067751000000</Longitude>
<NumberOfReviews>0</NumberOfReviews>
<ConsumerRating>0</ConsumerRating>
<PropertyType>9</PropertyType>
<ChainID>0</ChainID>
<Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities>
</hotel>
<hotel>
<hotelId>1726938</hotelId>
<hotelFileName>1_Value_Inn_Clovis</hotelFileName>
<hotelName>#1 Value Inn Clovis</hotelName>
<rating>2</rating>
<cityId>28538</cityId>
<cityFileName>Clovis_New_Mexico</cityFileName>
<cityName>Clovis (New Mexico)</cityName>
<stateId>32</stateId>
<stateFileName>New_Mexico</stateFileName>
<stateName>New Mexico</stateName>
<countryCode>US</countryCode>
<countryFileName>United_States</countryFileName>
<countryName>United States</countryName>
<imageId/>
<Address>1720 Mabry</Address>
<minRate/>
<currencyCode>USD</currencyCode>
<Latitude>34.396549224853</Latitude>
<Longitude>-103.182769775390</Longitude>
<NumberOfReviews>0</NumberOfReviews>
<ConsumerRating>0</ConsumerRating>
<PropertyType>2</PropertyType>
<ChainID>0</ChainID>
<Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities>
</hotel>
</hotels>
我试过这个代码:
class Wikihandler < Nokogiri::XML::SAX::Document
def initialize
# do one-time setup here, called as part of Class.new
end
def start_element(name, attributes = [])
# check the element name here and create an active record object if appropriate
if name == 'hotel'
a = Hash[*attributes]
puts attributes
# more business...
end
end
def characters(s)
# save the characters that appear here and possibly use them in the current tag object
end
def end_element(name)
# check the tag name and possibly use the characters you've collected
# and save your activerecord object now
end
end
parser = Nokogiri::XML::SAX::Parser.new(Wikihandler.new)
parser.parse_file('HotelCombinedXml/Hotels_All.xml')
我可以访问标签的标签,但如何访问其内容?
Wikihandler#characters
将显示内容。你可以做一些类似的事情:
class MyDocument < Nokogiri::XML::SAX::Document
attr_accessor :is_name
def initialize
@is_name = false
end
def end_document
puts "the document has ended"
end
def start_element name, attributes = []
@is_name = name.eql?("hotelName")
end
def characters string
string.strip!
if @is_name and !string.empty?
puts "Name: #{string}"
end
end
end
然而,如果你想让你的生活更轻松,我建议你去看看萨克斯机。它为Nokogiri的SAX解析器添加了一些不错的功能和更友好的接口(IMHO)。以下是一些示例代码和规格:
require "sax-machine"
require "rspec"
XML = <<XML
<?xml version="1.0" encoding="utf-8"?>
<hotels>
<hotel>
<hotelId>1568054</hotelId>
<hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName>
<hotelName>"Der Obere Wirt" zum Queri</hotelName>
<rating>3</rating>
<cityId>34633</cityId>
<cityFileName>Andechs</cityFileName>
<cityName>Andechs</cityName>
<stateId>212</stateId>
<stateFileName>Bavaria</stateFileName>
<stateName>Bavaria</stateName>
<countryCode>DE</countryCode>
<countryFileName>Germany</countryFileName>
<countryName>Germany</countryName>
<imageId>51498149</imageId>
<Address>Georg Queri Ring 9</Address>
<minRate>85.9800</minRate>
<currencyCode>EUR</currencyCode>
<Latitude>48.009423000000</Latitude>
<Longitude>11.214504000000</Longitude>
<NumberOfReviews>16</NumberOfReviews>
<ConsumerRating>4.25</ConsumerRating>
<PropertyType>0</PropertyType>
<ChainID>0</ChainID>
<Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities>
</hotel>
<hotel>
<hotelId>1658359</hotelId>
<hotelFileName>Seclusions_of_Yallingup</hotelFileName>
<hotelName>"Seclusions" of Yallingup</hotelName>
<rating>4</rating>
<cityId>72257</cityId>
<cityFileName>Yallingup</cityFileName>
<cityName>Yallingup</cityName>
<stateId>172</stateId>
<stateFileName>Western_Australia</stateFileName>
<stateName>Western Australia</stateName>
<countryCode>AU</countryCode>
<countryFileName>Australia</countryFileName>
<countryName>Australia</countryName>
<imageId>53234107</imageId>
<Address>58 Zamia Grove</Address>
<minRate>218.1825</minRate>
<currencyCode>AUD</currencyCode>
<Latitude>-33.691192000000</Latitude>
<Longitude>115.061938999999</Longitude>
<NumberOfReviews>0</NumberOfReviews>
<ConsumerRating>0</ConsumerRating>
<PropertyType>3</PropertyType>
<ChainID>0</ChainID>
<Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities>
</hotel>
<hotel>
<hotelId>1491947</hotelId>
<hotelFileName>1_Melrose_Blvd</hotelFileName>
<hotelName>#1 Melrose Blvd</hotelName>
<rating>5</rating>
<cityId>964</cityId>
<cityFileName>Johannesburg</cityFileName>
<cityName>Johannesburg</cityName>
<stateId/>
<stateFileName/>
<stateName/>
<countryCode>ZA</countryCode>
<countryFileName>South_Africa</countryFileName>
<countryName>South Africa</countryName>
<imageId>46777171</imageId>
<Address>1 Melrose Boulevard Melrose Arch</Address>
<minRate/>
<currencyCode>ZAR</currencyCode>
<Latitude>-26.135656000000</Latitude>
<Longitude>28.067751000000</Longitude>
<NumberOfReviews>0</NumberOfReviews>
<ConsumerRating>0</ConsumerRating>
<PropertyType>9</PropertyType>
<ChainID>0</ChainID>
<Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities>
</hotel>
<hotel>
<hotelId>1726938</hotelId>
<hotelFileName>1_Value_Inn_Clovis</hotelFileName>
<hotelName>#1 Value Inn Clovis</hotelName>
<rating>2</rating>
<cityId>28538</cityId>
<cityFileName>Clovis_New_Mexico</cityFileName>
<cityName>Clovis (New Mexico)</cityName>
<stateId>32</stateId>
<stateFileName>New_Mexico</stateFileName>
<stateName>New Mexico</stateName>
<countryCode>US</countryCode>
<countryFileName>United_States</countryFileName>
<countryName>United States</countryName>
<imageId/>
<Address>1720 Mabry</Address>
<minRate/>
<currencyCode>USD</currencyCode>
<Latitude>34.396549224853</Latitude>
<Longitude>-103.182769775390</Longitude>
<NumberOfReviews>0</NumberOfReviews>
<ConsumerRating>0</ConsumerRating>
<PropertyType>2</PropertyType>
<ChainID>0</ChainID>
<Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities>
</hotel>
</hotels>
XML
class Hotel
include SAXMachine
element :hotelId, :as => :id
element :hotelName, :as => :name
end
class Wikihandler
include SAXMachine
elements :hotel, :as => :hotels, :class => Hotel
end
describe Wikihandler do
before(:all) do
@parser = Wikihandler.new
@parser.parse XML
end
it "should parse the proper number of hotels" do
@parser.hotels.count.should eq 4
end
it "should parse the hotel id of each entry" do
@parser.hotels[0].id.should eq "1568054"
end
it "should parse the hotel name of each entry" do
@parser.hotels[0].name.should eq '"Der Obere Wirt" zum Queri'
end
end