如果您在后面的部分看到下面的输出,ruby正在删除所有html实体。如何在不丢失HTML实体的情况下使用nokogiri解析XML?
--- BEFORE ---
<blog:entryFull>
<p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
--- AFTER ---
<blog:entryFull>
piframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"/iframe/p</blog:entryFull>
</blog:example>
这是代码:
f = File.open(item)
contents = ""
f.each {|line|
contents << line
}
puts "--- BEFORE ---"
puts contents
puts "--- AFTER ---"
doc = Nokogiri::XML::DocumentFragment.parse(contents)
puts doc
f.close
您的测试文件可能包含一些无效的HTML实体。
nokogiri.rb:
require 'nokogiri'
puts "--- INVALID ---"
invalid_xml = <<-XML
<blog:entryFull>invalid M&Ms</blog:entryFull><!-- invalid M and M's -->
<blog:entryFull>
<p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
XML
doc = Nokogiri::XML::DocumentFragment.parse(invalid_xml)
puts doc
puts "--- VALID ---"
valid_xml = <<-XML
<blog:entryFull>valid M&Ms</blog:entryFull><!-- valid M and M's -->
<blog:entryFull>
<p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
XML
doc = Nokogiri::XML::DocumentFragment.parse(valid_xml)
puts doc
结果:
$ ruby nokogiri.rb
--- INVALID ---
<blog:entryFull>invalid M</blog:entryFull><!-- invalid M and M's -->
<blog:entryFull>
piframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"/iframe/p</blog:entryFull>
--- VALID ---
<blog:entryFull>valid M&Ms</blog:entryFull><!-- valid M and M's -->
<blog:entryFull>
<p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
所以,
- 修复输入XML
- 使用STRICT ParseOptions
严格解析示例:
invalid_xml = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
<blog:entryFull>invalid M&Ms</blog:entryFull>
<blog:entryFull>
<p><iframe src="http://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
</root>
XML
begin
doc = Nokogiri::XML(invalid_xml) do |configure|
configure.strict # strict parsing
end
puts doc
rescue => e
puts 'INVALID XML'
end
Qambar,我无法重新创建您的问题。然而,给定这些文件/输入,我能够产生您想要的输出:
test.xml
<blog:entryFull> <p><iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
nokogiri.rb
require 'nokogiri'
f = File.open("./test.html")
contents = ""
f.each {|line|
contents << line
}
puts "--- BEFORE ---"
puts contents
puts "--- AFTER ---"
doc = Nokogiri::XML::DocumentFragment.parse(contents)
puts doc.inner_html
f.close
控制台
Development/Code » ruby nokogiri.rb
--- BEFORE ---
<blog:entryFull> <p><iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
--- AFTER ---
<blog:entryFull> <p><iframe src="https://w.soundcloud.com/player/?url=http%3A%2F%2Fapi.soundcloud.com%2Ftracks%2F39858946&amp;show_artwork=true%22" width="100%" height="166" frameborder="no" scrolling="no"></iframe></p></blog:entryFull>
我所做的工作是通过regex获取xml标记,然后使用html实体转换html实体。然后使用nokogiri html解析器进行解析。