增加创建熊猫数据帧时的内存使用量

我有一段代码接收来自另一个函数的回调并创建一个列表(pd_arr)。然后，此列表用于创建数据框。最后删除列表列表。

在使用内存分析器进行性能分析时，这是输出

102.632812 MiB   0.000000 MiB       init()
236.765625 MiB 134.132812 MiB           add_to_list()
return pd.DataFrame()
394.328125 MiB 157.562500 MiB       pd_df = pd.DataFrame(pd_arr, columns=df_columns)
350.121094 MiB -44.207031 MiB       pd_df = pd_df.set_index(df_columns[0])
350.292969 MiB   0.171875 MiB       pd_df.memory_usage()
350.328125 MiB   0.035156 MiB       print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0]), sys.getsizeof(pd_df), len(pd_arr)
350.328125 MiB   0.000000 MiB       del pd_arr

在检查pd_df(数据帧)的深度内存使用情况时，它是 80.5 MB。所以，我的问题是为什么del pd_arr行后内存不减少。

此外，每个探查器的总数据帧大小 (157 - 44 = 110 MB) 似乎超过 80 MB。那么，是什么导致了差异呢？

另外，是否有任何其他节省内存的方法来创建数据帧(循环接收的数据)，其时间性能还不错(例如：对于大小为 100MB 的数据帧，以 10 秒的增量应该没问题)？

编辑：简单的python脚本，解释这种行为

Filename: py_test.py
Line #    Mem usage    Increment   Line Contents
================================================
9    102.0 MiB      0.0 MiB   @profile
10                             def setup():
11                              global arr, size
12    102.0 MiB      0.0 MiB    arr = range(1, size)
13    131.2 MiB     29.1 MiB    arr = [x+1 for x in arr]

Filename: py_test.py
Line #    Mem usage    Increment   Line Contents
================================================
21    131.2 MiB      0.0 MiB   @profile
22                             def tearDown():
23                              global arr
24    131.2 MiB      0.0 MiB    del arr[:]
25    131.2 MiB      0.0 MiB    del arr
26     93.7 MiB    -37.4 MiB    gc.collect()

在引入数据帧时，

Filename: py_test.py
Line #    Mem usage    Increment   Line Contents
================================================
9    102.0 MiB      0.0 MiB   @profile
10                             def setup():
11                              global arr, size
12    102.0 MiB      0.0 MiB    arr = range(1, size)
13    132.7 MiB     30.7 MiB    arr = [x+1 for x in arr]

Filename: py_test.py
Line #    Mem usage    Increment   Line Contents
================================================
15    132.7 MiB      0.0 MiB   @profile
16                             def dfCreate():
17                              global arr
18    147.1 MiB     14.4 MiB    pd_df = pd.DataFrame(arr)
19    147.1 MiB      0.0 MiB    return pd_df

Filename: py_test.py
Line #    Mem usage    Increment   Line Contents
================================================
21    147.1 MiB      0.0 MiB   @profile
22                             def tearDown():
23                              global arr
24                              #del arr[:]
25    147.1 MiB      0.0 MiB    del arr
26    147.1 MiB      0.0 MiB    gc.collect()

回答你的第一个问题，当你尝试使用del pd_arr清理内存时，实际上这不会发生，因为DataFrame存储一个链接到pd_arr，而顶级范围会保留一个链接;减少 refcounter 不会收集内存，因为此内存正在使用中。

您可以通过在del pd_arr之前运行sys.getrefcount(pd_arr)来检查我的假设，结果您将得到2。

现在，我相信以下代码片段与您尝试执行的操作相同：https://gist.github.com/vladignatyev/ec7a26b7042efd6f710d436afbfb87de/90df8cc6bbb8bd0cb3a1d2670e03aff24f3a5b24

如果您尝试此代码段，您将看到内存使用情况如下：

Line #    Mem usage    Increment   Line Contents
================================================
13   63.902 MiB    0.000 MiB   @profile
14                             def to_profile():
15  324.828 MiB  260.926 MiB       pd_arr = make_list()
16                                 # pd_df = pd.DataFrame.from_records(pd_arr, columns=[x for x in range(0,1000)])
17  479.094 MiB  154.266 MiB       pd_df = pd.DataFrame(pd_arr)
18                                 # pd_df.info(memory_usage='deep')
19  479.094 MiB    0.000 MiB       print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0])
20  481.055 MiB    1.961 MiB       print sys.getsizeof(pd_df), len(pd_arr)
21  481.055 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
22  417.090 MiB  -63.965 MiB       del pd_arr
23  323.090 MiB  -94.000 MiB       gc.collect()

试试这个例子：

@profile
def test():
a = [x for x in range(0,100000)]
del a

aa = test()

你会得到你所期望的：

Line #    Mem usage    Increment   Line Contents
================================================
6   64.117 MiB    0.000 MiB   @profile
7                             def test():
8   65.270 MiB    1.152 MiB       a = [x for x in range(0,100000)]
9                                 # print sys.getrefcount(a)
10   64.133 MiB   -1.137 MiB       del a
11   64.133 MiB    0.000 MiB       gc.collect()

另外，如果您调用sys.getrefcount(a)，内存有时会在del a之前被清理：

Line #    Mem usage    Increment   Line Contents
================================================
6   63.828 MiB    0.000 MiB   @profile
7                             def test():
8   65.297 MiB    1.469 MiB       a = [x for x in range(0,100000)]
9   64.230 MiB   -1.066 MiB       print sys.getrefcount(a)
10   64.160 MiB   -0.070 MiB       del a

但是当你使用pandas时，事情会变得疯狂。

如果你打开pandas.DataFrame的源代码，你会看到，当你用list初始化DataFrame的情况下，pandas创建新的 NumPy 数组并复制它的内容。看看这个： https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L329

删除pd_arr不会释放内存，因为无论如何DataFrame创建和退出函数后都会收集pd_arr内存，因为它没有任何额外的链接。getrefcount之前和之后的电话证明了这一点。

从纯列表创建新DataFrame使您的列表使用 NumPy 数组复制。(查看np.array(data, dtype=dtype, copy=copy)和有关array的相应文档) 复制操作可能会影响执行时间，因为分配新的内存块是一项繁重的操作。

我尝试使用 Numpy 数组初始化新的数据帧。唯一的区别是内存开销出现numpy.Array位置。比较以下两个代码段：

def make_list():  # 1
pd_arr = []
for i in range(0,10000):
pd_arr.append([x for x in range(0,1000)])
return np.array(pd_arr)

和

def make_list():  #2
pd_arr = []
for i in range(0,10000):
pd_arr.append([x for x in range(0,1000)])
return pd_arr

数字 #1(创建数据帧不会产生内存使用开销！

Line #    Mem usage    Increment   Line Contents
================================================
14   63.672 MiB    0.000 MiB   @profile
15                             def to_profile():
16  385.309 MiB  321.637 MiB       pd_arr = make_list()
17  385.309 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
18  385.316 MiB    0.008 MiB       pd_df = pd.DataFrame(pd_arr)
19  385.316 MiB    0.000 MiB       print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0])
20  386.934 MiB    1.617 MiB       print sys.getsizeof(pd_df), len(pd_arr)
21  386.934 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
22  386.934 MiB    0.000 MiB       del pd_arr
23  305.934 MiB  -81.000 MiB       gc.collect()

数字 #2(由于复制阵列而导致开销超过 100Mb)！：

Line #    Mem usage    Increment   Line Contents
================================================
14   63.652 MiB    0.000 MiB   @profile
15                             def to_profile():
16  325.352 MiB  261.699 MiB       pd_arr = make_list()
17  325.352 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
18  479.633 MiB  154.281 MiB       pd_df = pd.DataFrame(pd_arr)
19  479.633 MiB    0.000 MiB       print sys.getsizeof(pd_arr), sys.getsizeof(pd_arr[0])
20  481.602 MiB    1.969 MiB       print sys.getsizeof(pd_df), len(pd_arr)
21  481.602 MiB    0.000 MiB       print sys.getrefcount(pd_arr)
22  417.621 MiB  -63.980 MiB       del pd_arr
23  330.621 MiB  -87.000 MiB       gc.collect()

因此，仅使用 Numpy 数组初始化DataFrame，而不是list。从内存消耗的角度来看，它更好，而且可能更快，因为它不需要额外的内存分配调用。

希望现在我已经回答了你所有的问题。

相关内容

最新更新

热门标签：