解压缩带有特殊字符的固定宽度 unicode 文件行.Python UnicodeDecodeError.

我正在尝试解析数据库文件的每一行以使其准备好导入。它有固定宽度的行，但以字符为单位，而不是以字节为单位。我已经根据马蒂诺的答案编写了一些东西，但我在特殊字符方面遇到了麻烦。

有时他们会打破预期的宽度，有时他们只会抛出UnicodeDecodeError。我相信解码错误可以修复，但我可以继续这样做struct.unpack并正确解码特殊字符吗？我认为问题在于它们以多个字节编码，弄乱了预期的字段宽度，据我了解，字段宽度是以字节为单位而不是以字符为单位。

import os, csv
def ParseLine( arquivo):
    import struct, string   
    format = "1x 12s 1x 18s 1x 16s"
    expand = struct.Struct(format).unpack_from
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
    for line in arquivo:
        fields = unpack(line)
        yield [x.strip() for x in fields]
Caminho = r"C:Sample"
os.chdir(Caminho)
with open("Sample data.txt", 'r') as arq: 
    with open("Out" + ".csv", "w", newline ='') as sai: 
        Write = csv.writer(sai, delimiter= ";", quoting=csv.QUOTE_MINIMAL).writerows
        for line in ParseLine(arq): 
            Write([line])

示例数据：

|     field 1|      field 2     |     field 3    |
| sreaodrsa  | raesodaso t.thl o| .tdosadot. osa |
| resaodra   | rôn. 2x  17/220V | sreao.tttra v  |
| esarod sê  | raesodaso t.thl o| .tdosadot. osa |
| esarod sa í| raesodaso t.thl o| .tdosadot. osa |

实际输出：

field 1;field 2;field 3
sreaodrsa;raesodaso t.thl o;.tdosadot. osa
resaodra;rôn. 2x  17/22;V | sreao.tttra

在输出中，我们看到第 1 行和第 2 行符合预期。第 3 行的宽度错误，可能是由于多字节ô。第 4 行引发以下异常：

Traceback (most recent call last):
  File "C:SampleFindSample.py", line 18, in <module>
    for line in ParseLine(arq):
  File "C:SampleFindSample.py", line 9, in ParseLine
    fields = unpack(line)
  File "C:SampleFindSample.py", line 7, in <lambda>
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
  File "C:SampleFindSample.py", line 7, in <genexpr>
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 11: unexpected end of data

我需要对每个字段执行特定的操作，所以我不能像以前那样对整个文件使用re.sub。我想保留这段代码，因为它看起来很有效并且处于工作的边缘。不过，如果有更有效的解析方法，我可以尝试一下。我需要保留特殊字符。

事实上，struct方法在这里失败了，因为它期望字段的字节宽度是固定的，而您的格式使用固定数量的代码点。

我根本不会在这里使用struct。您的行已经解码为 Unicode 值，只需使用切片来提取数据：

def ParseLine(arquivo):
    slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
    for line in arquivo:
        yield [line[s].strip() for s in slices]

这完全处理已解码行中的字符，而不是字节。如果您有字段宽度而不是索引，则还可以生成slice()对象：

def widths_to_slices(widths):
    pos = 0
    for width in widths:
        pos += 1  # delimiter
        yield slice(pos, pos + width)
        pos += width
def ParseLine(arquivo):
    widths = (12, 18, 16)
    for line in arquivo:
        yield [line[s].strip() for s in widths_to_slices(widths)]

演示：

>>> sample = '''
... |     field 1|      field 2     |     field 3    |
... | sreaodrsa  | raesodaso t.thl o| .tdosadot. osa |
... | resaodra   | rôn. 2x  17/220V | sreao.tttra v  |
... | esarod sê  | raesodaso t.thl o| .tdosadot. osa |
... | esarod sa í| raesodaso t.thl o| .tdosadot. osa |
... '''.splitlines()
>>> def ParseLine(arquivo):
...     slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
...     for line in arquivo:
...         yield [line[s].strip() for s in slices]
... 
>>> for line in ParseLine(sample):
...     print(line)
... 
['field 1', 'field 2', 'field 3']
['sreaodrsa', 'raesodaso t.thl o', '.tdosadot. osa']
['resaodra', 'rôn. 2x  17/220V', 'sreao.tttra v']
['esarod sê', 'raesodaso t.thl o', '.tdosadot. osa']
['esarod sa í', 'raesodaso t.thl o', '.tdosadot. osa']

相关内容

最新更新

热门标签：