在 NumPY 中矢量化后的性能损失

我正在编写一个耗时的程序。为了减少时间，我尽力使用numpy.dot而不是for循环。

但是，我发现矢量化程序的性能比 for 循环版本差得多：

import numpy as np
import datetime
kpt_list = np.zeros((10000,20),dtype='float')
rpt_list = np.zeros((1000,20),dtype='float')
h_r = np.zeros((20,20,1000),dtype='complex')
r_ndegen = np.zeros(1000,dtype='float')
r_ndegen.fill(1)
# setup completed
# this is a the vectorized version
r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
start = datetime.datetime.now()
phase = np.exp(1j * np.dot(rpt_list, kpt_list.T))/r_ndegen_tile
kpt_data_1 = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 19.302483
# this is the for loop version
kpt_data_2 = np.zeros((20, 20, 10000), dtype='complex')
start = datetime.datetime.now()
for i in range(10000):
    kpt = kpt_list[i, :]
    phase = np.exp(1j * np.dot(kpt, rpt_list.T))/r_ndegen
    kpt_data_2[:, :, i] = h_r.dot(phase)
end = datetime.datetime.now()
print((end-start).total_seconds())
# the result is 7.74583

这是怎么回事？

我建议你做的第一件事是将脚本分解为单独的函数，以使分析和调试更容易：

def setup(n1=10000, n2=1000, n3=20, seed=None):
    gen = np.random.RandomState(seed)
    kpt_list = gen.randn(n1, n3).astype(np.float)
    rpt_list = gen.randn(n2, n3).astype(np.float)
    h_r = (gen.randn(n3, n3,n2) + 1j*gen.randn(n3, n3,n2)).astype(np.complex)
    r_ndegen = gen.randn(1000).astype(np.float)
    return kpt_list, rpt_list, h_r, r_ndegen

def original_vec(*args, **kwargs):
    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
    r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
    phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
    kpt_data = h_r.dot(phase)
    return kpt_data

def original_loop(*args, **kwargs):
    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
    kpt_data = np.zeros((20, 20, 10000), dtype='complex')
    for i in range(10000):
        kpt = kpt_list[i, :]
        phase = np.exp(1j * np.dot(kpt, rpt_list.T)) / r_ndegen
        kpt_data[:, :, i] = h_r.dot(phase)
    return kpt_data

我还强烈建议使用随机数据而不是全零或全一数组，除非这是您的实际数据的样子（！这使得检查代码的正确性变得更加容易 - 例如，如果你的最后一步是乘以零矩阵，那么你的输出将始终是全零，无论代码前面是否有错误。

接下来，我将通过line_profiler运行这些函数，以查看它们大部分时间都花在哪里。特别是，对于original_vec：

In [1]: %lprun -f original_vec original_vec()
Timer unit: 1e-06 s
Total time: 23.7598 s
File: <ipython-input-24-c57463f84aad>
Function: original_vec at line 12
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    12                                           def original_vec(*args, **kwargs):
    13                                           
    14         1        86498  86498.0      0.4      kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
    15                                           
    16         1        69700  69700.0      0.3      r_ndegen_tile = np.tile(r_ndegen.reshape(1000, 1), 10000)
    17         1      1331947 1331947.0      5.6      phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen_tile
    18         1     22271637 22271637.0     93.7      kpt_data = h_r.dot(phase)
    19                                           
    20         1            4      4.0      0.0      return kpt_data

您可以看到它花费了 93% 的时间计算 h_r 和 phase 之间的点积。在这里，h_r是一个(20, 20, 1000)数组，phase是(1000, 10000) 。我们正在计算h_r的最后一个维度和phase的第一个维度上的和积（你可以用einsum表示法将其写为ijk,kl->ijl）。

h_r的前两个维度在这里并不重要 - 在取点积之前，我们可以很容易地将h_r重塑为(20*20, 1000)数组。事实证明，这种重塑操作本身就带来了巨大的性能改进：

In [2]: %timeit h_r.dot(phase)
1 loop, best of 3: 22.6 s per loop
In [3]: %timeit h_r.reshape(-1, 1000).dot(phase)
1 loop, best of 3: 1.04 s per loop

我不完全确定为什么会这样 - 我希望 numpy 的dot函数足够聪明，可以自动应用这个简单的优化。在我的笔记本电脑上，第二种情况似乎使用多个线程，而第一种情况则没有，这表明它可能没有调用多线程 BLAS 例程。

下面是包含整形操作的矢量化版本：

def new_vec(*args, **kwargs):
    kpt_list, rpt_list, h_r, r_ndegen = setup(*args, **kwargs)
    phase = np.exp(1j * np.dot(rpt_list, kpt_list.T)) / r_ndegen[:, None]
    kpt_data = h_r.reshape(-1, phase.shape[0]).dot(phase)
    return kpt_data.reshape(h_r.shape[:2] + (-1,))

-1索引告诉 numpy 根据其他维度和数组中元素的数量推断这些维度的大小。我还使用广播除以r_ndegen，这消除了np.tile的需要。

通过使用相同的随机输入数据，我们可以检查新版本是否给出与原始版本相同的结果：

In [4]: ans1 = original_loop(seed=0)
In [5]: ans2 = new_vec(seed=0)    
In [6]: np.allclose(ans1, ans2)
Out[6]: True

一些性能基准：

In [7]: %timeit original_loop()
1 loop, best of 3: 13.5 s per loop
In [8]: %timeit original_vec()
1 loop, best of 3: 24.1 s per loop
In [5]: %timeit new_vec()
1 loop, best of 3: 2.49 s per loop

更新：

我很好奇为什么原始np.dot数组(20, 20, 1000) h_r要慢得多，所以我深入研究了 numpy 源代码。multiarraymodule.c中实现的逻辑竟然非常简单：

#if defined(HAVE_CBLAS)
    if (PyArray_NDIM(ap1) <= 2 && PyArray_NDIM(ap2) <= 2 &&
            (NPY_DOUBLE == typenum || NPY_CDOUBLE == typenum ||
             NPY_FLOAT == typenum || NPY_CFLOAT == typenum)) {
        return cblas_matrixproduct(typenum, ap1, ap2, out);
    }
#endif

换句话说，numpy 只检查任何一个输入数组是否具有 2 维>，并立即回退到矩阵-矩阵乘法的非 BLAS 实现。检查两个数组的内部尺寸是否兼容似乎应该不会太难，如果是，请将它们视为 2D 并对其执行*gemm矩阵乘法。事实上，有一个开放的功能请求可以追溯到 2012 年，如果有任何 numpy 开发人员正在阅读......

同时，在乘以张量时，这是一个很好的性能技巧。

更新 2：

我忘记了np.tensordot.由于它调用与 2D 阵列上的 np.dot 相同的底层 BLAS 例程，因此可以实现相同的性能提升，但无需所有这些丑陋的reshape操作：

In [6]: %timeit np.tensordot(h_r, phase, axes=1)
1 loop, best of 3: 1.05 s per loop

我怀疑第一个操作达到了资源限制。也许您可以从这两个问题中受益：大型内存映射数组的高效点积和numpy中大型数组的点积。

更新：

更新 2：

相关内容

最新更新

热门标签：