Nokogiri Gem 不会使用 SAX 处理程序解析文件



>我有带有标头的xml文件

<?xml version="1.0" encoding="utf-16"?>

并且它还包含

<transmission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">

使用 SAX 解析器时,它不会解析。但是当手动删除编码部分和属性后传输;XML解析成功。由于文件很大;I只能使用 SAX.Is 任何其他方法来解析此 XML 文件,而无需手动删除编码和传输属性。

示例代码是

      require 'nokogiri'
        include Nokogiri

class P < Nokogiri::XML::SAX::Document
      def initialize
      end
      def start_element(element, attributes = [])
        puts element
      end
      def cdata_block(string)
      end
      def characters(string)
      end
      def end_element(element)
        puts element
      end
 end
    parser = Nokogiri::XML::SAX::Parser.new(P.new())
    parser.parse_file('file_dummy.xml')

尝试实现 SAX 方法套件,看看你得到了什么:

require 'nokogiri'
class MyDoc < Nokogiri::XML::SAX::Document
  def cdata_block(str)
    puts "cdata_block: #{str}"
  end
  def characters(str)
    puts "characters: #{str}"
  end
  def comment(str)
    puts "comment: #{str}"
  end
  def end_element(str)
    puts "end_element: #{str}"
  end
  def end_document
    puts "end_document"
  end
  def end_element_namespace(name, prefix = nil, uri = nil)
    puts "end_element_namespace: name: #{name} prefix: #{prefix} uri: #{uri}"
  end
  def error(str)
    puts "error:#{str}"
  end
  def processing_instruction(name, content)
    puts "processing_instruction: name: #{name} content: #{content}"
  end
  def start_document
    puts "start_document"
  end
  def start_element(str, attrs = [])
    puts "start_element: #{str} attrs: #{attrs}"
  end
  def start_element_namespace(name, attrs=[], prefix=nil, uri=nil, ns=[])
    puts "start_element_namespace: name: #{name} attrs: #{attrs} prefix: #{prefix} uri: #{uri} ns: #{ns}"
  end
  def warning(str)
    puts "warning: #{str}"
  end
  def xmldecl(version, encoding, standalone)
    puts "xmldecl: version: #{version} encoding: #{encoding} standalone: #{standalone}"
  end
end
parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new)
parser.parse(File.open(ARGV[0]))

将其保存到脚本并使用以下命令运行它:

ruby path/to/script.rb path/to/file.xml

您应该会看到输出。例如,将以下内容用作简单的 XML 文件:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
</catalog>

我得到以下输出:

xmldecl: version: 1.0 encoding:  standalone:
start_document
start_element_namespace: name: catalog attrs: [] prefix:  uri:  ns: []
characters:
start_element_namespace: name: book attrs: [#<struct Nokogiri::XML::SAX::Parser::Attribute localname="id", prefix=nil, uri=nil, value="bk101">] prefix:  uri:  ns: []
characters:
start_element_namespace: name: author attrs: [] prefix:  uri:  ns: []
characters: Gambardella, Matthew
end_element_namespace: name: author prefix:  uri:
characters:
start_element_namespace: name: title attrs: [] prefix:  uri:  ns: []
characters: XML Developer's Guide
end_element_namespace: name: title prefix:  uri:
characters:
start_element_namespace: name: genre attrs: [] prefix:  uri:  ns: []
characters: Computer
end_element_namespace: name: genre prefix:  uri:
characters:
start_element_namespace: name: price attrs: [] prefix:  uri:  ns: []
characters: 44.95
end_element_namespace: name: price prefix:  uri:
characters:
start_element_namespace: name: publish_date attrs: [] prefix:  uri:  ns: []
characters: 2000-10-01
end_element_namespace: name: publish_date prefix:  uri:
characters:
start_element_namespace: name: description attrs: [] prefix:  uri:  ns: []
characters: An in-depth look at creating applications
      with XML.
end_element_namespace: name: description prefix:  uri:
characters:
end_element_namespace: name: book prefix:  uri:
characters:
end_element_namespace: name: catalog prefix:  uri:
end_document

经过多次推荐。我得到了答案。这是@thetinman的答案。但未完全吸收。使用 sed 命令将 utf-16 替换为 utf-8 并解析文件。为什么我需要 sed 操作是 nokogiri 导致这个 utf-16 出现问题

相关内容

最新更新