Pandas/Sklearn给出不正确的内存错误



我正在使用 Python2.7 (Anaconda 4.0) 在具有大量内存的 EC2 实例上的 Jupyter 笔记本上工作(根据 free 为 60GB,48GB 可用)。我加载了一个 Pandas (v0.18) 数据帧,该数据帧很大(150K 行,每行 ~30KB),但远不及实例的内存容量,即使制作了许多副本也是如此。某些 Pandas 和 Scikit-learn (v0.17) 调用会立即触发 MemoryError,例如:

#X is a subset of the original df with 60 columns instead of the 3000
#Y is a float column
X.add(Y)
#And for sklearn...
pca = decomposition.KernelPCA(n_components=5)
pca.fit(X,Y)

同时,这些工作正常:

Z = X.copy(deep=True)
pca = decomposition.PCA(n_components=5)

最令人困惑的是,我可以做到这一点,它在几秒钟内完成:

huge = range(1000000000)

我已经重新启动了笔记本、内核和实例,但相同的调用不断给出MemoryError.我还验证了我使用的是64位Python。有什么建议吗?

更新:添加回溯错误:

Traceback (most recent call last):
  File "<ipython-input-9-ae71777140e2>", line 2, in <module>
    Z = X.add(Y)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/ops.py", line 1057, in f
    return self._combine_series(other, na_op, fill_value, axis, level)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 3500, in _combine_series
    fill_value=fill_value)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 3528, in _combine_match_columns
    copy=False)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 2730, in align
    broadcast_axis=broadcast_axis)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 4152, in align
    fill_axis=fill_axis)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 4234, in _align_series
    fdata = fdata.reindex_indexer(join_index, lidx, axis=0)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3528, in reindex_indexer
    fill_tuple=(fill_value,))
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3591, in _slice_take_blocks_ax0
    fill_value=fill_value))
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3621, in _make_na_block
    block_values = np.empty(block_shape, dtype=dtype)
MemoryError

Traceback (most recent call last):
  File "<ipython-input-13-d510bc16443e>", line 3, in <module>
    pca.fit(X,Y)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/kernel_pca.py", line 202, in fit
    K = self._get_kernel(X)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/kernel_pca.py", line 135, in _get_kernel
    filter_params=True, **params)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1347, in pairwise_kernels
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1054, in _parallel_pairwise
    return func(X, Y, **kwds)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 716, in linear_kernel
    return safe_sparse_dot(X, Y.T, dense_output=True)
  File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
    return fast_dot(a, b)
MemoryError
弄清楚了

熊猫方面的问题。我有一个DF和一个系列,带有匹配的索引X和Y。我想我可以添加 Y 作为另一列,如下所示:

X.add(Y)

但是这样做会尝试在列上匹配 Y,而不是在索引上匹配 Y,从而创建一个 150Kx150K 数组。我需要提供轴:

X.add(Y, axis='index')

相关内容

  • 没有找到相关文章

最新更新