从c#的流中读取单个UTF8字符

我希望从流或BinaryReader中读取下一个UTF8字符。不工作的东西:

BinaryReader::ReadChar——这将抛出一个3或4字节的字符。因为它返回一个双字节结构，所以别无选择。

BinaryReader::ReadChars——如果你要求它读取1个字符，而它遇到一个3或4字节的字符，这将抛出。如果你要求它读取多于1个字符，它将读取多个字符。

StreamReader::Read——这需要知道要读取多少字节，但是UTF8字符中的字节数是可变的。

我的代码，似乎工作:

    private char[] ReadUTF8Char(Stream s)
    {
        byte[] bytes = new byte[4];
        var enc = new UTF8Encoding(false, true);
        if (1 != s.Read(bytes, 0, 1))
            return null;
        if (bytes[0] <= 0x7F) //Single byte character
        {
            return enc.GetChars(bytes, 0, 1);
        }
        else
        {
            var remainingBytes =
                ((bytes[0] & 240) == 240) ? 3 : (
                ((bytes[0] & 224) == 224) ? 2 : (
                ((bytes[0] & 192) == 192) ? 1 : -1
            ));
            if (remainingBytes == -1)
                return null;
            s.Read(bytes, 1, remainingBytes);
            return enc.GetChars(bytes, 0, remainingBytes + 1);
        }
    }

显然，这有点乱，并且有点特定于UTF8。是否有一种更优雅、更少定制、更容易阅读的解决方案来解决这个问题?

我知道这个问题有点老了，但这里有另一个解决方案。它在性能上不如OPs解决方案(我也更喜欢OPs)，但它只使用内置的utf8功能，而不知道utf8编码的内部机制。

private static char ReadUTF8Char(Stream s)
{
    if (s.Position >= s.Length)
        throw new Exception("Error: Read beyond EOF");
    using (BinaryReader reader = new BinaryReader(s, Encoding.Unicode, true))
    {
        int numRead = Math.Min(4, (int)(s.Length - s.Position));
        byte[] bytes = reader.ReadBytes(numRead);
        char[] chars = Encoding.UTF8.GetChars(bytes);
        if (chars.Length == 0)
            throw new Exception("Error: Invalid UTF8 char");
        int charLen = Encoding.UTF8.GetByteCount(new char[] { chars[0] });
        s.Position += (charLen - numRead);
        return chars[0];
    }
}

传递给BinaryReader的构造函数的编码无关紧要。我不得不使用这个版本的构造函数使流保持打开状态。如果你已经有了二进制读取器，你可以这样写:

private static char ReadUTF8Char(BinaryReader reader)
{
    var s = reader.BaseStream;
    if (s.Position >= s.Length)
        throw new Exception("Error: Read beyond EOF");
    int numRead = Math.Min(4, (int)(s.Length - s.Position));
    byte[] bytes = reader.ReadBytes(numRead);
    char[] chars = Encoding.UTF8.GetChars(bytes);
    if (chars.Length == 0)
        throw new Exception("Error: Invalid UTF8 char");
    int charLen = Encoding.UTF8.GetByteCount(new char[] { chars[0] });
    s.Position += (charLen - numRead);
    return chars[0];
}

相关内容

最新更新

热门标签：