Pandas:read_csv()方法:字符串长度限制

我有一个包含数百万行(26Gb(的文件。但是第二行具有23000000个b'x00'符号(NUL(。当我从一个文件读取到DataFrame时，它无法读取第二行，所以我只收到一行。

是否有可能使用read_csv方法读取所有数据？
1.3.5和1.1.4版本的Pandas之间有什么区别吗？令人惊讶的是，它读取了1.1.4版本的完整数据。

作为我所做评论的一个例子，您可以通过一次分块读取来提前转换文件以剥离NUL字节：

import partial
import pandas

# Open two files, your original and a new file
with open("somehugefile.csv", 'rb') as infile, open('transformed.csv', 'wb') as outfile:
# Iterate in 100000 byte chunks, rather than line-by-line
for chunk in iter(partial(infile.read, 100000), b''):
# write the transformed content to the new file
outfile.write(chunk.replace(b'x00', b''))

# then process your new file instead
df = pd.read_csv('transformed.csv')

相关内容

最新更新

热门标签：