是什么原因导致不同安装之间的python 3.4字节.decode()行为不同



我看到在Python 3.4.3上,在两个盒子上解码字节串的行为不同——一个运行OS X,另一个运行Debian Wheezy。

在操作系统X上:

$ python
Python 3.4.3 (default, Mar 10 2015, 14:53:35) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> s = b'xc4x8dtrnxc3xa1ct'
>>> print(s.decode("utf-8"))
čtrnáct

关于Debian:

$ python
Python 3.4.3 (default, Apr  4 2015, 22:21:17) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> s = b'xc4x8dtrnxc3xa1ct'
>>> print(s.decode("utf-8"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character 'u010d' in position 0: ordinal not in range(128)

在这两次安装中,一定有一些配置略有不同的东西导致了这种情况。我已经检查了两者的默认编码,结果是相同的,但我不确定我能检查什么。

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'

更新:区域设置返回两者之间的差异:

操作系统X:

LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

Debian:

$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

我找到了答案-我遵循了"Locales:configuration"部分http://perlgeek.de/en/article/set-up-a-clean-utf8-environment.具体来说,有用的步骤是:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

相关内容

最新更新