我正在尝试修复CSV文件中的空字节问题。
csv_file
对象是从我的Flask应用程序中的另一个函数传入的:
stream = codecs.iterdecode(csv_file.stream, "utf-8-sig", errors="strict")
dict_reader = csv.DictReader(stream, skipinitialspace=True, restkey="INVALID")
for row in dict_reader: # Error is thrown here
...
控制台中抛出的错误为_csv.Error: line contains NULL byte
。
到目前为止,我已经尝试过:
- 不同的编码类型(我检查了编码类型,它是utf-8-sig(
- 使用
.replace('x00', '')
但我似乎无法删除这些空字节。
我想删除空字节并用空字符串替换它们,但我也可以跳过包含空字节的行;我无法共享我的csv文件。
编辑:我达成的解决方案:
content = csv_file.read()
# Converting the above object into an in-memory byte stream
csv_stream = io.BytesIO(content)
# Iterating through the lines and replacing null bytes with empty
string
fixed_lines = (line.replace(b'x00', b'') for line in csv_stream)
# Below remains unchanged, just passing in fixed_lines instead of csv_stream
stream = codecs.iterdecode(fixed_lines, 'utf-8-sig', errors='strict')
dict_reader = csv.DictReader(stream, skipinitialspace=True, restkey="INVALID")
我认为您的问题肯定需要显示您期望从csv_file.stream
获得的字节流的示例。
我喜欢督促自己更多地了解Python的IO、编码/解码和CSV方法,所以我已经为自己做了很多工作,但可能不希望其他人这样做。
import csv
from codecs import iterdecode
import io
# Flask's file.stream is probably BytesIO, see https://stackoverflow.com/a/18246385
# and the Gist in the comment, https://gist.github.com/lost-theory/3772472?permalink_comment_id=1983064#gistcomment-1983064
csv_bytes = b'''xefxbbxbf C1, C2
r1c1, r1c2
r2c1, r2c2, r2c3x00'''
# This is what Flask is probably giving you
csv_stream = io.BytesIO(csv_bytes)
# Fixed lines is another iterator, `(line.repl...)` vs. `[line.repl...]`
fixed_lines = (line.replace(b'x00', b'') for line in csv_stream)
decoded_lines = iterdecode(fixed_lines, 'utf-8-sig', errors='strict')
reader = csv.DictReader(decoded_lines, skipinitialspace=True, restkey="INVALID")
for row in reader:
print(row)
我得到:
{'C1': 'r1c1', 'C2': 'r1c2'}
{'C1': 'r2c1', 'C2': 'r2c2', 'INVALID': ['r2c3']}