磁盘上的熊猫稀疏数据框架比密集版本更大

我发现，保存到磁盘时，数据框的稀疏版本实际上比密集版本大得多。我在做什么错？

test = pd.DataFrame(ones((4,4000)))
test.ix[:,:] = nan
test.ix[0,0] = 47
test.to_hdf('test3', 'df')
test.to_sparse(fill_value=nan).to_hdf('test4', 'df')
test.to_pickle('test5')
test.to_sparse(fill_value=nan).to_pickle('test6')
....
ls -sh test*
200K test3   16M test4  164K test5  516K test6

使用版本0.12.0

我最终希望有效地存储10^7 x 60个阵列，密度约为10％，然后将它们拉入熊猫数据范围并与它们一起播放。

编辑：感谢Jeff回答了原始问题。后续问题：这似乎仅节省腌制，而不是使用其他格式（例如HDF5）。腌制我最好的路线吗？

print shape(array_activity) #This is just 0s and 1s
(1020000, 60)
test = pd.DataFrame(array_activity)
test_sparse = test.to_sparse()
print test_sparse.density
0.0832333496732
test.to_hdf('1', 'df')
test_sparse.to_hdf('2', 'df')
test.to_pickle('3')
test_sparse.to_pickle('4')
!ls -sh 1 2 3 4
477M 1  544M 2  477M 3   83M 4

这是一个数据，作为matlab .mat文件中的索引列表，小于12m。我渴望将其放入HDF5/Pytables格式中，以便我可以抓取特定的索引（其他文件更大，并且需要更长的时间才能加载到内存中），然后很容易地向他们做Pandasy的事情。也许我不会以正确的方式进行操作？

您正在创建一个具有4000列的帧，只有4行；稀疏在行方面处理，因此扭转了尺寸。

In [2]: from numpy import *
In [3]: test = pd.DataFrame(ones((4000,4)))
In [4]: test.ix[:,:] = nan
In [5]: test.ix[0,0] = 47
In [6]: test.to_hdf('test3', 'df')
In [7]: test.to_sparse(fill_value=nan).to_hdf('test4', 'df')
In [8]: test.to_pickle('test5')
In [9]: test.to_sparse(fill_value=nan).to_pickle('test6')
In [11]: !ls -sh test3 test4 test5 test6
164K test3  148K test4  160K test5   36K test6

随访。您提供的商店以table格式编写，因此保存了密集版本（对于表格格式，不支持稀疏版本，这些格式非常灵活且可查询，请参见文档。

此外，您可能需要使用稀疏格式的2个不同表示来实验保存文件。

所以，这是一个示例会话：

df = 
In [1]: df = pd.read_hdf('store_compressed.h5','test')
In [2]: type(df)
Out[2]: pandas.core.frame.DataFrame
In [3]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)
In [4]: df.to_sparse(kind='integer').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)
In [5]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)
In [6]: df.to_sparse(kind='integer').to_hdf('test_integer.h5','test',mode='w',complib='blosc',complevel=9)
In [7]: df.to_hdf('test_dense_fixed.h5','test',mode='w',complib='blosc',complevel=9)
In [8]: df.to_hdf('test_dense_table.h5','test',mode='w',format='table',complib='blosc',complevel=9)
In [9]: !ls -ltr *.h5
-rwxrwxr-x 1 jreback users 57015522 Feb  6 18:19 store_compressed.h5
-rw-rw-r-- 1 jreback users 30335044 Feb  6 19:01 test_block.h5
-rw-rw-r-- 1 jreback users 28547220 Feb  6 19:02 test_integer.h5
-rw-rw-r-- 1 jreback users 44540381 Feb  6 19:02 test_dense_fixed.h5
-rw-rw-r-- 1 jreback users 57744418 Feb  6 19:03 test_dense_table.h5

iirc它们是0.12中的一个错误，因为 to_hdf并不是通过所有参数传递，因此您想使用：

with get_store('test.h5',mode='w',complib='blosc',complevel=9) as store:
    store.put('test',df)

这些基本存储在基本上是SparseSeries的集合，因此，如果密度较低且不连续，则它的大小不会那么最小。熊猫稀疏的套件可以更好地处理少量的连续块，虽然ymmv。Scipy还提供了一些稀疏的处理工具。

尽管恕我直言，但无论如何，这些都是HDF5文件的微不足道的大小，您可以处理巨大的行。并且可以轻松处理10千兆字节和100千兆字节的文件大小（尽管建议）。

此外，如果您确实可以查询，则可以考虑使用表格格式。

相关内容

最新更新

热门标签：