解析公共Lisp中已知长度的UTF-8字符串，每次一个字节

我正在用Common Lisp编写一个程序，用于编辑生成的二进制文件使用NBT格式，记录在这里:http://minecraft.gamepedia.com/NBT_format?cookieSetup=true(我知道存在这样的工具，例如NBTEditor和MCEdit，但它们都不是用Common Lisp编写的，我认为这个项目将是一个很好的学习练习)。

到目前为止，我唯一没有自己实现的事情之一是一个函数，用于读取已知长度的UTF-8字符串，该字符串包含使用多个八位字节(即非ascii字符)表示的字符。在NBT格式中，每个字符串都是UTF-8编码的，前面是一个短(两个八位字节)整数n，表示字符串的长度。因此，假设字符串中只存在ASCII字符，我可以简单地从流中读取n字节序列并使用如下命令将其转换为字符串:

(defun read-utf-8-string (string-length byte-stream)
  (let ((seq (make-array string-length :element-type '(unsigned-byte 8)
                                       :fill-pointer t)))
    (setf (fill-pointer seq) (read-sequence seq byte-stream))
    (flexi-streams:octets-to-string seq :external-format :utf-8)))

但是如果一个或多个字符的字符码大于255，它被编码为两个或多个字节，如下例所示:

(flexi-streams:string-to-octets "wife" :external-format :utf-8)
==> #(119 105 102 101)
(flexi-streams:string-to-octets "жена" :external-format :utf-8)
==> #(208 182 208 181 208 189 208 176)

两个字符串长度相同，但每个字符都是俄语单词以两倍的八位字节数进行编码，因此字符串是英文字符串的两倍。因此，如果使用read-sequence，知道字符串长度是没有帮助的。即使大小字符串(即编码它所需的八位元组的数量)是已知的，仍然没有办法知道这些八位元组中哪些单独转换为字符形式，哪些组合在一起进行转换。因此，我没有滚动自己的函数，而是试图找到一种方法，让实现(Clozure CL)或外部库为我完成工作。不幸的是，这也有问题，因为我的解析器依赖于对所有读取函数使用相同的文件流，像这样:

(with-open-file (stream "test.dat" :direction :input
                                   :element-type '(unsigned-byte 8))
  ;;Read entire contents of NBT file from stream here)

限制我使用:element-type '(unsigned-byte 8)，因此禁止我指定字符编码和使用read-char(或类似的)，如下所示:

(with-open-file (stream "test.dat" :external-format :utf-8)
  ...)

:element-type必须是'(unsigned-byte 8)，以便我可以读取和写入各种大小的整数和浮点数。避免手动操作将八位字节序列转换为字符串时，我首先想知道是否有一种方法可以在文件打开时将元素类型更改为字符类型，这导致我在这里进行讨论:https://groups.google.com/forum/! searchin二进制20美元/comp.lang.lisp/读/写20美元comp.lang.lisp/N0IESNPSPCU/Qmcvtk0HkC0J显然，一些CL实现(如SBCL)默认使用二值流，因此可以在同一流上使用read-byte和read-char;如果我要采用这种方法，我仍然需要能够为流指定:external-format (:utf-8)，尽管这种格式应该只在读取字符时应用，而不是在读取原始字节时应用。

为了简洁起见，我在上面的例子中使用了两个来自灵活流的函数，但是到目前为止，我的代码只使用内置的流类型，我还没有使用灵活流本身。这个问题是一个很好的候选柔性流?有一个额外的抽象层，允许我从同一流中交替读取原始字节和UTF-8字符将是理想的。

熟悉flex -streams(或其他相关方法)的人提供的任何建议都将非常感谢。

谢谢。

我写了一些东西:

首先，给定第一个字节，我们想知道某个字符的编码实际有多长。

(defun utf-8-number-of-bytes (first-byte)
  "returns the length of the utf-8 code in number of bytes, based on the first byte.
The length can be a number between 1 and 4."
  (declare (fixnum first-byte))
  (cond ((=       0 (ldb (byte 1 7) first-byte)) 1)
        ((=   #b110 (ldb (byte 3 5) first-byte)) 2)
        ((=  #b1110 (ldb (byte 4 4) first-byte)) 3)
        ((= #b11110 (ldb (byte 5 3) first-byte)) 4)
        (t (error "unknown number of utf-8 bytes for ~a" first-byte))))

然后解码:

(defun utf-8-decode-unicode-character-code-from-stream (stream)
  "Decodes byte values, from a binary byte stream, which describe a character
encoded using UTF-8.
Returns the character code and the number of bytes read."
  (let* ((first-byte (read-byte stream))
         (number-of-bytes (utf-8-number-of-bytes first-byte)))
    (declare (fixnum first-byte number-of-bytes))
    (ecase number-of-bytes
      (1 (values (ldb (byte 7 0) first-byte)
                 1))
      (2 (values (logior (ash (ldb (byte 5 0) first-byte) 6)
                         (ldb (byte 6 0) (read-byte stream)))
                 2))
      (3 (values (logior (ash (ldb (byte 5 0) first-byte) 12)
                         (ash (ldb (byte 6 0) (read-byte stream)) 6)
                         (ldb (byte 6 0) (read-byte stream)))
                 3))
      (4 (values (logior (ash (ldb (byte 3 0) first-byte) 18)
                         (ash (ldb (byte 6 0) (read-byte stream)) 12)
                         (ash (ldb (byte 6 0) (read-byte stream)) 6)
                         (ldb (byte 6 0) (read-byte stream)))
                 4))
      (t (error "wrong UTF-8 encoding for file position ~a of stream ~s"
                (file-position stream)
                stream)))))

你知道有多少个字符。N字符。您可以为N个字符分配一个支持unicode的字符串。所以你调用函数N多次。然后，对于每个结果，将结果转换为字符并将其放入字符串中。

相关内容

最新更新

热门标签：