memmapped ndarray上的numpy.std失败,出现MemoryError



我映射了一个巨大的(30GB)ndarray内存:

arr = numpy.memmap(afile, dtype=numpy.float32, mode="w+", shape=(n, n,))

在用一些值填充它之后(这很好——最大内存使用率低于1GB),我想计算标准偏差:

print('stdev: {0:4.4f}n'.format(numpy.std(arr)))

这条线路以MemoryError故障严重。

我不知道为什么会失败。如果能给我一些技巧,我将不胜感激。如何以高效记忆的方式计算这些?

环境:venv+Python3.6.2+NumPy1.13.1

实际上,numpy的stdmean实现了数组的完整副本,并且内存效率非常低。这里有一个更好的实现:

# Memory overhead is BLOCKSIZE * itemsize. Should be at least ~1MB 
# for efficient HDD access.
BLOCKSIZE = 1024**2
# For numerical stability. The closer this is to mean(arr), the better.
PIVOT = arr[0]
n = len(arr)
sum_ = 0.
sum_sq = 0.
for block_start in xrange(0, n, BLOCKSIZE):
block_data = arr[block_start:block_start + BLOCKSIZE]
block_data -= PIVOT
sum_ += np.sum(block_data)
sum_sq += np.sum(block_data**2)
stdev = np.sqrt(sum_sq / n - (sum_ / n)**2)
import math
BLOCKSIZE = 1024**2
# For numerical stability. The closer this is to mean(arr), the better.
PIVOT = arr[0]

n = len(arr)
sum_ = 0.
sum_sq = 0.
for block_start in xrange(0, n, BLOCKSIZE):
block_data = arr[block_start:block_start + BLOCKSIZE]
block_data -= PIVOT
sum_ += math.fsum(block_data)
sum_sq += math.fsum(block_data**2)
stdev = np.sqrt(sum_sq / n - (sum_ / n)**2)

最新更新