Django 在文本包含表情符号字符时在 model.save() 调用中"surrogates not allowed"错误

我们目前正在构建一个通过Django将文本存储在PostgreSQL数据库中的系统。然后通过PGSync将数据提取到ElasticSearch。

目前，我们在测试用例中遇到了以下问题

错误消息：

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 159-160: surrogates not allowed

我们确定了导致该问题的特征。这是一个表情符号。

文本本身是希腊字符的混合体；英文字符"；还有表情符号。希腊语没有显示为希腊语，而是以u形式显示。

导致问题的相关文本：

u03bcu03b5 Some English Text ud83dude9bn#SomeHashTag

ud83dude9b翻译成这个表情符号：🚛

正如这里所说：https://python-list.python.narkive.com/aKjK4Jje/encoding-of-surrogate-code-points-to-utf-8

The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters.

PostgreSQL有以下编码：

默认值：UTF8
排序规则：en_US.utf8
C类型：en_US.utf8

这是utf8的问题吗？还是特定于表情符号？这是django问题还是postgresql问题？

再现问题：

x='u03bcu03b5 Some English Text ud83dude9bn#SomeHashTag'
print(x)

Traceback(最近调用last(：文件&"；，第1行，在UnicodeEncodeError:"utf-8"编解码器无法对中的字符进行编码位置21-22：不允许代理

解决方案：按如下方式应用raw_unicode_escape和unicode_escape编解码器(请参见Python特定编码(：

y = x.encode('raw_unicode_escape').decode('unicode_escape').encode('utf-16_BE','surrogatepass').decode('utf-16_BE')
print(y)

με Some English Text 🚛
#SomeHashTag

相关内容

最新更新

热门标签：