将Unicode的UTF8表示形式写入文件

我有一个专有的文件（数据库）格式，我目前正在尝试将其迁移到SQL数据库。因此，我正在将文件转换为sql转储，该转储已经运行良好。现在剩下的唯一问题是他们处理不在ASCII十进制32到126范围内的字符的奇怪方式。他们有一个以Unicode（十六进制，例如20AC=€）存储的所有字符的集合，并通过自己的内部索引进行索引。

我现在的计划是：我想创建一个表，其中存储内部索引、unicode（十六进制）和字符表示（UTF-8）。此表可用于将来的更新。

现在问题来了：如何将unicode十六进制值的UTF-8字符表示写入文件？当前代码如下：

this->outFile.open(fileName + ".sql", std::ofstream::app);
std::string protyp;
this->inFile.ignore(2); // Ignore the ID = 01.
std::getline(this->inFile, protyp); // Get the PROTYP Identifier (e.g. 321)
protyp = "\" + protyp;
std::string unicodeHex;
this->inFile.ignore(2); // Ignore the ID = 01.
std::getline(this->inFile, unicodeHex); // Get the Unicode HEX Identifier (e.g. 002C)
std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
const std::wstring wide_string = this->s2ws("\u" + unicodeHex);
const std::string utf8_rep = converter.to_bytes(wide_string);
std::string valueString = "('" + protyp + "', '" + unicodeHex + "', '" + utf8_rep + "')";
this->outFile << valueString << std::endl;
this->outFile.close();

但这只是打印出这样的东西：

('321', '002C', 'u002C'),

而期望的输出是：

('321', '002C', ','),

我做错了什么？我不得不承认，当涉及到字符编码和其他东西时，我并不那么确定：/。如果有什么不同的话，我正在使用Windows7 64位。提前谢谢。

正如@Mark Ransom在评论中指出的那样，我的最佳选择是将十六进制字符串转换为整数并使用它。这就是我所做的：

unsigned int decimalHex = std::stoul(unicodeHex, nullptr, 16);;
std::string valueString = "('" + protyp + "', '" + unicodeHex + "', '" + this->UnicodeToUTF8(decimalHex) + "')";

而UnicodeToUTF8的函数是从这里取的无符号整数作为UTF-8值

std::string UnicodeToUTF8(unsigned int codepoint)
{
    std::string out;
    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}

相关内容

最新更新

热门标签：