如何从巨大的tar.gz中的csv文件中逐块获取pandas数据帧,而无需对其进行解压缩和迭代



我有一个巨大的压缩文件,我有兴趣在上面读取各个数据帧,以免内存耗尽。

此外,由于时间和空间的原因,我无法解压缩.tar.gz.

这是我目前掌握的代码:

import pandas as pd
# With this lib we can navigate on a compressed files
# without even extracting its content
import tarfile
import io
tar_file = tarfile.open(r'\pathtothetarfile.tar.gz')
# With the following code we can iterate over the csv contained in the compressed file
def generate_individual_df(tar_file):
return 
(
(
member.name, 
pd.read_csv(io.StringIO(tar_file.extractfile(member).read().decode('ascii')), header=None)
)
for member in tar_file
if member.isreg()
)
for filename, dataframe in generate_individual_df(tar_file):
# But dataframe is the whole file, which is too big

尝试了如何从tar.gz中压缩的csv创建Panda数据帧?但仍然无法解决。。。

您实际上可以使用以下函数迭代压缩文件中的块:

def generate_individual_df(tar_file, chunksize=10**4):
return 
(
(
member.name, 
chunk
)
for member in tar_file
if member.isreg()
for chunk in pd.read_csv(io.StringIO(tar_file.extractfile(member)
.read().decode('ascii')), header=None, chunksize=chunksize)
)

最新更新