编码 UTF-8 不适用于所有德语字符

我读到一个类似于以下的geo-pands文件：

file = gpd.read_file('./County.shp', encoding='utf-8')
file.head()

在某些情况下，编码效果良好。例如，没有编码，它是GÃ¶ttingen，但有了编码，它就是Göttingen。

然而，它并不适用于所有情况。例如，Gebietseinheit Mittelfranken ohne Großstadte被读取为b'Gebietseinheit Kassel ohne Groxdfstxe4dte'

我该怎么解决这个问题？

xdf是ß；同样地，xe4是ä:

>>> 'xdf'
'ß'
>>> 'xe4'
'ä'

所以编码没有错。

实际上，这是因为文件被读取为bytes字符串，这就是b前缀的含义：

>>> b'xdf'
b'xdf'
>>> b'xdf'
b'xe4'

所以它们是相同的值，但Python只是以不同的方式显示它们。

另外：

# With the b prefix:
>>> b'Gebietseinheit Kassel ohne Groxdfstxe4dte'
b'Gebietseinheit Kassel ohne Groxdfstxe4dte'
# Without the b prefix:
>>> 'Gebietseinheit Kassel ohne Groxdfstxe4dte'
'Gebietseinheit Kassel ohne Großstädte'

如果要打印具有看起来正常的特殊字符的字符串，请使用bytes.decode将其转换为str，使用latin编码：

>>> bytes_str = b'Gebietseinheit Kassel ohne Groxdfstxe4dte'
>>> bytes_str
b'Gebietseinheit Kassel ohne Groxdfstxe4dte'
>>> normal_str = bytes_str.decode('latin1')
>>> normal_str
'Gebietseinheit Kassel ohne Großstädte'

相关内容

最新更新

热门标签：