Nokogiri窒息与unicode字符(我认为)在标题属性的文档

我有一个看起来像这样的文档(注意标题):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
  <head>
    <title>Sã�ng Title</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  </head>
  <body>
    <div id="container">
      Some Text
    </div>
  </body>
</html>

当我使用Nokogiri使用以下代码获取此文档时:

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open(url).read)

Nokogiri的结果是:

ruby-1.9.2-p290 :060 > pp doc
#(Document:0x82e5ed2c {
  name = "document",
  children = [
    #(DTD:0x82e5e994 { name = "HTML" }),
    #(Element:0x82e5e0c0 {
      name = "html",
      attributes = [
        #(Attr:0x82e5e05c {
          name = "xmlns",
          value = "http://www.w3.org/1999/xhtml"
          }),
        #(Attr:0x82e5e048 {
          name = "xmlns:fb",
          value = "http://www.facebook.com/2008/fbml"
          })],
      children = [
        #(Element:0x82e5d8dc {
          name = "head",
          children = [
            #(Element:0x82e5d6d4 {
              name = "title",
              children = [ #(Text "Sã")]
              })]
          })]
      })]
  })

对我来说，看起来字符"Sã"之后导致nokogiri只是窒息，认为文档已经结束了。正如你所看到的，#contentdiv根本不包括在内。

有人知道如何处理这种情况吗?

这简直要了我的命…谢谢你! !

编辑:经过进一步的研究，我发现导致阻塞的实际字符是一个unicode空字符"u0000"。

现在我想我可以这样做:

page_content = open(url).read
# Remove null character
page_content.gsub!(/u0000/, '')
Nokogiri::HTML(page_content)

您确定Sã后面的字符是有效的UTF-8字符吗?

添加存在非法的UTF-8字符序列。要手动解码UTF-8，请尝试此解码器。你可以输入输入的十六进制，它会告诉你每个字节的含义。

对UTF-8的一个很好的概述。UTF-8代码图

Re:删除空字符。您的代码看起来不错，试试吧!但除此之外，我会调查传入数据流中null的来源。

另外，您的原始Post的二进制UTF-8实际上是未知的字符符号——而不是您的原始数据流。这是你的帖子:

53 C3 A3 EF BF BD 6E 67

下面是解码:

U+0053 LATIN CAPITAL LETTER S character
U+00E3 LATIN SMALL LETTER A WITH TILDE character (&#x00E3;)
U+FFFD REPLACEMENT CHARACTER character (&#xFFFD;)  # this is the char used when
                                                   # the orig is not understood.
U+006E LATIN SMALL LETTER N character
U+0067 LATIN SMALL LETTER G character

相关内容

最新更新

热门标签：