Postgres COPY FROM 文件抛出 unicode 错误，而引用的字符显然不在文件中

首先，感谢Stack Overflow上的每个人过去，现在和未来的帮助。你们都把我从灾难中拯救出来(无论是我自己的设计还是其他设计(，次数太多了，数不清了。

本期是我公司决定从Microsoft SQL Server 2005数据库过渡到PostgreSQL 9.4的一部分。我们一直在关注Postgres维基(https://wiki.postgresql.org/wiki/Microsoft_SQL_Server_to_PostgreSQL_Migration_by_Ian_Harding(上的注释，这些是我们针对相关表格遵循的步骤：

下载表数据 [在 Windows 客户端上]：

bcp "Carbon.consensus.observations" out "Carbon.consensus.observations" -k -S [servername] -T -w

复制到Postgres服务器 [运行CentOS 7]

在Postgres服务器上运行Python预处理脚本以更改编码并清理：

import sys
import os
import re
import codecs
import fileinput
base_path = '/tmp/tables/'
cleaned_path = '/tmp/tables_processed/'
files = os.listdir(base_path)
for filename in files:
    source_path = base_path + filename
    temp_path = '/tmp/' + filename
    target_path = cleaned_path + filename
    BLOCKSIZE = 1048576 # or some other, desired size in bytes
    with open(source_path, 'r') as source_file:
        with open(target_path, 'w') as target_file:
            start = True
            while True:
                contents = source_file.read(BLOCKSIZE).decode('utf-16le')
                if not contents:
                    break
                if start:
                    if contents.startswith(codecs.BOM_UTF8.decode('utf-8')):
                        contents = contents.replace(codecs.BOM_UTF8.decode('utf-8'), ur'')
                contents = contents.replace(ur'x80', u'')
                contents = re.sub(ur'00', ur'', contents)
                contents = re.sub(ur'rn', ur'n', contents)
                contents = re.sub(ur'r', ur'\r', contents)
                target_file.write(contents.encode('utf-8'))
                start = False
    for line in fileinput.input(target_path, inplace=1):
        if 'x80' in line:
            line = line.replace(r'x80', '')
        sys.stdout.write(line)

执行 SQL 以加载表：

COPY consensus.observations FROM '/tmp/tables_processed/Carbon.consensus.observations';

问题是 COPY 命令失败并出现 unicode 错误：

[2015-02-24 19:52:24] [22021] ERROR: invalid byte sequence for encoding "UTF8": 0x80
Where: COPY observations, line 2622420: "..."

鉴于这很可能是由于表中的数据错误(其中还包含合法的非 ASCII 字符(，我正在尝试在上下文中找到实际的字节序列，但我在任何地方都找不到它(sed 查看有问题的行，正则表达式替换字符作为预处理的一部分等(。作为参考，此 grep 不返回任何内容：

cat /tmp/tables_processed/Carbon.consensus.observations | grep --color='auto' -P "[x80]"

我在跟踪此字节序列在上下文中的位置时做错了什么？

我建议将SQL文件(似乎是/tmp/tables_processed/Carbon.consensus.observations(加载到具有十六进制模式的编辑器中。这应该允许您在上下文中查看它(取决于确切的编辑器(。

gVim(或基于终端的Vim(是我推荐的一个选项。

例如，如果我在 gVim 中打开一个包含以下内容的 SQL复制文件：

1   1.2
2   1.1
3   3.2

我可以通过命令%!xxd(在 gVim 或终端 Vim 中(或菜单选项工具转换为十六进制模式>转换为十六进制模式。

这将产生此显示：

0000000: 3109 312e 320a 3209 312e 310a 3309 332e  1.1.2.2.1.1.3.3.
0000010: 320a                                     2.

然后，您可以运行%!xxd -r将其转换回来，或者运行菜单选项工具>转换回来。

注意：这实际上修改了文件，因此建议对原始文件的副本执行此操作，以防更改以某种方式被写入(您必须在 Vim 中显式保存缓冲区(。

这样，您可以在左侧看到十六进制序列，在右侧看到它们的 ASCII 等效序列。如果搜索80，您应该能够在上下文中看到它。但是，对于gVim，两种模式的行号将不同，如本例所示。

不过，您找到的第一个80很可能是那条线，因为如果有更早的那条线，它很可能会在这些线上失败。

我过去使用的另一个可能有帮助的工具是图形十六进制编辑器 GHex。由于这是一个 GNOME 项目，不太确定它是否适用于 CentOS。wxHexEditor 据说可以与 CentOS 配合使用，从网站上看起来很有前途，尽管我还没有使用它。它被定位为"海量文件的十六进制编辑器"，所以如果你的SQL文件很大，这可能是要走的路。

相关内容

最新更新

热门标签：