为什么Nokogiri要截断这个元素



我正在使用Nokogiri和Ruby 1.9.2解析XML文件。一切似乎都很好,直到我阅读Descriptions(下面)。文本被截断了。输入文本为:

<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.
There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.
The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>

但是我得到的却是:

g. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.

注意到它从g.开始,去掉了一半以上

下面是完整的XML文件:

<?xml version="1.0" encoding="utf-8"?>
<Hotel>
  <HotelID>1040900</HotelID>
  <HotelFileName>Copthorne_Hotel_Aberdeen</HotelFileName>
  <HotelName>Copthorne Hotel Aberdeen</HotelName>
  <CityID>10</CityID>
  <CityFileName>Aberdeen</CityFileName>
  <CityName>Aberdeen</CityName>
  <CountryCode>GB</CountryCode>
  <CountryFileName>United_Kingdom</CountryFileName>
  <CountryName>United Kingdom</CountryName>
  <StarRating>4</StarRating>
  <Latitude>57.146068572998</Latitude>
  <Longitude>-2.111680030823</Longitude>
  <Popularity>1</Popularity>
  <Address>122 Huntly Street</Address>
  <CurrencyCode>GBP</CurrencyCode>
  <LowRate>36.8354</LowRate>
  <Facilities>1|2|3|5|6|8|10|11|15|17|18|19|20|22|27|29|30|34|36|39|40|41|43|45|47|49|51|53|55|56|60|62|140|154|209</Facilities>
  <NumberOfReviews>239</NumberOfReviews>
  <OverallRating>3.95</OverallRating>
  <CleanlinessRating>3.98</CleanlinessRating>
  <ServiceRating>3.98</ServiceRating>
  <FacilitiesRating>3.83</FacilitiesRating>
  <LocationRating>4.06</LocationRating>
  <DiningRating>3.93</DiningRating>
  <RoomsRating>3.68</RoomsRating>
  <PropertyType>0</PropertyType>
  <ChainID>92</ChainID>
  <Checkin>14</Checkin>
  <Checkout>12</Checkout>
  <Images>
    <Image>19305754</Image>
    <Image>19305755</Image>
    <Image>19305756</Image>
    <Image>19305757</Image>
    <Image>19305758</Image>
    <Image>19305759</Image>
    <Image>19305760</Image>
    <Image>19305761</Image>
    <Image>19305762</Image>
    <Image>19305763</Image>
    <Image>19305764</Image>
    <Image>19305765</Image>
    <Image>19305766</Image>
    <Image>19305767</Image>
    <Image>37102984</Image>
  </Images>
  <Descriptions>
    <Description>
      <Name>General Description</Name>
      <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.
There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.
The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
    </Description>
    <Description>
      <Name>LocationDescription</Name>
      <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
    </Description>
  </Descriptions>
</Hotel>
下面是我的Ruby程序:
require 'rubygems'
require 'nokogiri'
require 'ap'
include Nokogiri
class Hotel < Nokogiri::XML::SAX::Document
    def initialize
        @h = {}
        @h["Images"] = Array.new([])
        @h["Descriptions"] = Array.new([])
        @desc = {}
    end
    def end_document
      ap @h
        puts "Finished..."
    end
    def start_element(element, attributes = [])
        @element = element
    @desc = {} if element == "Description"
    end
    def end_element(element, attributes = [])     
      @h["Images"] << @characters if element == "Image"
    @desc["Name"] = @characters if element == "Name"
    if element == "Value"
      @desc["Value"] = @characters
      @h["Descriptions"] << @desc
    end
    @h[element] = @characters unless %w(Images Image Descriptions Description Hotel Name Value).include? element
    end
    def characters(string)
        @characters = string
    end  
end
# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(Hotel.new)
# Feed the parser some XML
parser.parse(File.open("/Users/cbmeeks/Projects/shared/data/text/HotelDatabase_EN/00/1040900.xml", 'rb'))

谢谢

我去掉了XML,因为它有很多不必要的节点。下面是我如何在text后面添加的示例:

#!/usr/bin/env ruby
# encoding: UTF-8
xml =<<EOT
<?xml version="1.0" encoding="utf-8"?>
<Hotel>
  <Descriptions>
    <Description>
      <Name>General Description</Name>
      <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.
There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.
The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
    </Description>
    <Description>
      <Name>LocationDescription</Name>
      <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
    </Description>
  </Descriptions>
</Hotel>
EOT
require 'nokogiri'
doc = Nokogiri::XML(xml)
puts doc.search('Value').map{ |n| n.text }

输出示例:

科普索恩阿伯丁享有的位置接近几个酒吧,餐馆和其他消遣。这家阿伯丁酒店位于城市的西区,大约一英里从许多机会从事观光或简单的购物一天。阿伯丁国际机场距离阿伯丁Copthorne Hotel酒店约10英里。

科普索恩阿伯丁酒店共有89间客房。每间客房都配有直拨电话服务、熨裤机、咖啡和茶具,以及酒店提供的带浴袍和洗漱用品的私人浴室。这些房间颜色很浅。

酒店科普索恩阿伯丁提供客人一个餐厅,在那里他们可以享受他们的饭菜在一个有点正式的设置。客人可以在酒店的酒吧里喝点饮料,吃点便餐。这家酒店提供商务服务,并有会议室设在现场。酒店还为乘坐私家车抵达的客人提供安全的停车设施。阿伯丁首屈一指的四星级酒店位于市中心,紧邻联合街和主要的商业和娱乐区。距香港仔火车站10分钟车程,距国际机场仅10-20分钟车程。

这故意只在Value节点之后。修改样例以获取图像节点也很简单。

现在,有几个问题:为什么使用SAX模式?传入的XML是否大于主机的RAM所能容纳的大小?如果没有,请使用DOM,因为它更容易使用。

当我第一次运行它时,Ruby告诉我invalid multibyte char (US-ASCII),这意味着XML中有它不喜欢的东西。我通过添加# encoding行修复了这一点。我使用Ruby 1.9.2,它使处理这些事情变得更容易。

我使用CSS访问器进行搜索。Nokogiri支持XPath和CSS,因此您可以随心所欲地进行xml解析。

我遇到了一个类似的问题,下面是实际的解释:

def characters(string)
    @characters = string
end

实际上应该是这样的:

def start_element(element, attributes = [])     
  #...(other stuff)...
  # Reset/initialize @characters
  @characters = ""
end
def characters(string)
    @characters += string
end

基本原理是标签的内容实际上可以分成多个文本节点,如下所述:http://nokogiri.org/Nokogiri/XML/SAX/Document.html

给定一个连续的字符串,这个方法可以被调用多次。

只有文本主体的最后一部分被捕获,因为每次它遇到一个文本节点(即characters方法被调用),它取代了@characters的内容,而不是附加到它。

相关内容

  • 没有找到相关文章

最新更新