我有一个看起来像这样的文档(注意标题):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<title>Sã�ng Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<div id="container">
Some Text
</div>
</body>
</html>
当我使用Nokogiri使用以下代码获取此文档时:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open(url).read)
Nokogiri的结果是:
ruby-1.9.2-p290 :060 > pp doc
#(Document:0x82e5ed2c {
name = "document",
children = [
#(DTD:0x82e5e994 { name = "HTML" }),
#(Element:0x82e5e0c0 {
name = "html",
attributes = [
#(Attr:0x82e5e05c {
name = "xmlns",
value = "http://www.w3.org/1999/xhtml"
}),
#(Attr:0x82e5e048 {
name = "xmlns:fb",
value = "http://www.facebook.com/2008/fbml"
})],
children = [
#(Element:0x82e5d8dc {
name = "head",
children = [
#(Element:0x82e5d6d4 {
name = "title",
children = [ #(Text "Sã")]
})]
})]
})]
})
对我来说,看起来字符"Sã"之后导致nokogiri只是窒息,认为文档已经结束了。正如你所看到的,#contentdiv根本不包括在内。
有人知道如何处理这种情况吗?这简直要了我的命…谢谢你! !
编辑:经过进一步的研究,我发现导致阻塞的实际字符是一个unicode空字符"u0000"。
现在我想我可以这样做:
page_content = open(url).read
# Remove null character
page_content.gsub!(/u0000/, '')
Nokogiri::HTML(page_content)
您确定Sã后面的字符是有效的UTF-8字符吗?
添加存在非法的UTF-8字符序列。要手动解码UTF-8,请尝试此解码器。你可以输入输入的十六进制,它会告诉你每个字节的含义。
对UTF-8的一个很好的概述。UTF-8代码图
Re:删除空字符。您的代码看起来不错,试试吧!但除此之外,我会调查传入数据流中null的来源。
另外,您的原始Post的二进制UTF-8实际上是未知的字符符号——而不是您的原始数据流。这是你的帖子:
53 C3 A3 EF BF BD 6E 67
下面是解码:
U+0053 LATIN CAPITAL LETTER S character
U+00E3 LATIN SMALL LETTER A WITH TILDE character (ã)
U+FFFD REPLACEMENT CHARACTER character (�) # this is the char used when
# the orig is not understood.
U+006E LATIN SMALL LETTER N character
U+0067 LATIN SMALL LETTER G character