为什么 Series 的内存使用量约为数据帧的 1.5 倍?



这是代码:

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: from itertools import product
In [4]: index = list(map(''.join, product(*['ABCDEFGH']*4)))
In [5]: columns = list(map(''.join, product(*['xyzuvw']*3)))
In [6]: df = pd.DataFrame(np.random.randn(len(index), len(columns)), index=index, columns=columns)
In [7]: ser = df.stack()
In [8]: df.memory_usage().sum()
Out[8]: 7274496
In [10]: ser.memory_usage()
Out[10]: 10651360
In [11]: ser.memory_usage() / df.memory_usage().sum()
Out[11]: 1.4642059051238738
In [12]: df.to_hdf('f:/f1.h5', 'df')
In [13]: ser.to_hdf('f:/f2.h5', 'ser')
In [14]: import os
In [15]: os.stat('f:/f2.h5').st_size / os.stat('f:/f1.h5').st_size
Out[15]: 1.498167701758398

而熊猫的版本信息:

pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1

您的系列由MultiIndex索引,这会占用大量空间。ser.reset_index(drop = True).memory_usage(deep = True)返回7077968.

最新更新