如何在Unicode/UCS CodePoint和UTF16替代对之间转换



如何在c 14和以后的UTF16代码点和UTF16代理对之间来回转换?

编辑:删除了提及UCS-2替代物,因为没有这样的东西。谢谢 @remy-lebeau!

替代对标记信息页面说明(比Unicode标准9.0在§3.9,表3-5中指定的要好。

基本多语言平面之外的Unicode字符,即0xffff上方的代码,用UTF-16编码,由16位代码单元对,称为替代配对,通过以下方案:

  • 0x010000是从代码点减去的,在范围内留下20位数字0..0x0fffff;
  • 将前十位(范围为0..0x03ff的数字)添加到0xD800中,以提供第一个代码单元或高代码,将在0xd800..0xd800..0xdbff;
  • 范围内。
  • 将低10位(也在0..0x03ff范围内)添加到0xDC00中,以给出第二个代码单位或低替代物,将在0xdc00..0xdc00..0xdfff中。

在C 14中,以后可以写为:

#include <cstdint>
using codepoint = std::uint32_t;
using utf16 = std::uint16_t;
struct surrogate {
    utf16 high; // Leading
    utf16 low;  // Trailing
};
constexpr surrogate split(codepoint const in) noexcept {
    auto const inMinus0x10000 = (in - 0x10000);
    surrogate const r{
            static_cast<utf16>((inMinus0x10000 / 0x400) + 0xd800), // High
            static_cast<utf16>((inMinus0x10000 % 0x400) + 0xdc00)}; // Low
    return r;
}

在反向方向上,只需将高替代物的最后10位与低替代物的最后10位合并,然后添加0x10000

constexpr codepoint combine(surrogate const s) noexcept {
    return static_cast<codepoint>(
            ((s.high - 0xd800) * 0x400) + (s.low - 0xdc00) + 0x10000);
}

这是对这些转换的测试:

#include <cassert>
constexpr bool isValidUtf16Surrogate(utf16 v) noexcept
{ return (v & 0xf800) == 0xd800; }
constexpr bool isValidCodePoint(codepoint v) noexcept {
    return (v <= 0x10ffff)
        && ((v >= 0x10000) || !isValidUtf16Surrogate(static_cast<utf16>(v)));
}
constexpr bool isValidUtf16HighSurrogate(utf16 v) noexcept
{ return (v & 0xfc00) == 0xd800; }
constexpr bool isValidUtf16LowSurrogate(utf16 v) noexcept
{ return (v & 0xfc00) == 0xdc00; }
constexpr bool codePointNeedsUtf16Surrogates(codepoint v) noexcept
{ return (v >= 0x10000) && (v <= 0x10ffff); }
void test(codepoint const in) {
    assert(isValidCodePoint(in));
    assert(codePointNeedsUtf16Surrogates(in));
    auto const s = split(in);
    assert(isValidUtf16HighSurrogate(s.high));
    assert(isValidUtf16LowSurrogate(s.low));
    auto const out = combine(s);
    assert(isValidCodePoint(out));
    assert(in == out);
}
int main() {
    for (codepoint c = 0x10000; c <= 0x10ffff; ++c)
        test(c);
}

在C 11及以后,您可以使用std::wstring_convert在各种UTF/UCS编码之间转换,使用以下std::codecvt类型:

  • utf-8&lt; -> ucs-2:
    std::codecvt_utf8<char16_t>

  • utf-8&lt; -> utf-16:
    std::codecvt_utf8_utf16

  • utf-8&lt; -> utf-32/ucs-4:
    std::codecvt_utf8<char32_t>

  • ucs-2&lt; -> utf-16:
    std::codecvt_utf16<char16_t>

  • utf-16&lt; -> utf-32/ucs-4:
    std::codecvt_utf16<char32_t>

  • ucs-2&lt; -> utf-32/ucs-4:
    没有标准转换,但是如果需要,您可以为其编写自己的std::codecvt类。否则,使用以上两者之间的转换之一:
    UCS-2 <-> UTF-X <-> UTF-32/UCS-4

您不需要手动处理代理。

您可以使用std::u32string保存您的CodePoint(S),而std::u16string则可以容纳UTF-16/UCS-2 CodeUnits。

例如:

using convert_utf16_uf32 = std::wstring_convert<std::codecvt_utf16<char32_t>, char16_t>;
std::u16string CodepointToUTF16(const char32_t codepoint)
{
    const char32_t *p = &codepoint;
    return convert_utf16_uf32{}.from_bytes(
        reinterpret_cast<const char*>(p),
        reinterpret_cast<const char*>(p+1)
    );
}
std::u16string UTF32toUTF16(const std::u32string &str)
{
    return convert_utf16_uf32{}.from_bytes(
        reinterpret_cast<const char*>(str.data()),
        reinterpret_cast<const char*>(str.data()+str.size())
    );
}
char32_t UTF16toCodepoint(const std::u16string &str)
{
    std::string bytes = convert_utf16_uf32{}.to_bytes(str);
    return *(reinterpret_cast<const char32_t*>(bytes.data()));
}
std::u32string UTF16toUTF32(const std::u16string &str)
{
    std::string bytes = convert_utf16_uf32{}.to_bytes(str);
    return std::u32string(
       reinterpret_cast<const char32_t*>(bytes.data()),
       bytes.size() / sizeof(char32_t)
    );
}

最新更新