在简单的numpy操作中，CUDA GPU比CPU慢

我正在基于本文使用此代码来查看GPU加速度，但我所能看到的只是放缓：

import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import sys
if len(sys.argv) != 3:
    exit("Usage: " + sys.argv[0] + " [cuda|cpu] N(100000-11500000)")

@vectorize(["float32(float32, float32)"], target=sys.argv[1])
def VectorAdd(a, b):
    return a + b
def main():
    N = int(sys.argv[2])
    A = np.ones(N, dtype=np.float32)
    B = np.ones(N, dtype=np.float32)
    start = timer()
    C = VectorAdd(A, B)
    elapsed_time = timer() - start
    #print("C[:5] = " + str(C[:5]))
    #print("C[-5:] = " + str(C[-5:]))
    print("Time: {}".format(elapsed_time))
main()

结果：

$ python speed.py cpu 100000
Time: 0.0001056949986377731
$ python speed.py cuda 100000
Time: 0.11871792199963238
$ python speed.py cpu 11500000
Time: 0.013704434997634962
$ python speed.py cuda 11500000
Time: 0.47120747699955245

我无法发送更大的向量，因为这将生成numba.cuda.cudadrv.driver.CudaAPIError: Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE异常。

nvidia-smi的输出是

Fri Dec  8 10:36:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98                 Driver Version: 384.98                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 2000D        Off  | 00000000:01:00.0  On |                  N/A |
| 30%   36C   P12    N/A /  N/A |    184MiB /   959MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       933      G   /usr/lib/xorg/Xorg                            94MiB |
|    0       985      G   /usr/bin/gnome-shell                          86MiB |
+-----------------------------------------------------------------------------+

CPU的详细信息

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               58
Model name:          Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
Stepping:            9
CPU MHz:             3300.135
CPU max MHz:         3700.0000
CPU min MHz:         1600.0000
BogoMIPS:            6600.27
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            6144K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

GPU是Nvidia Quadro 2000d，有192个CUDA核心和1GB RAM。

更复杂的操作：

import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import sys
if len(sys.argv) != 3:
    exit("Usage: " + sys.argv[0] + " [cuda|cpu] N()")

@vectorize(["float32(float32, float32)"], target=sys.argv[1])
def VectorAdd(a, b):
    return a * b
def main():
    N = int(sys.argv[2])
    A = np.zeros((N, N), dtype='f')
    B = np.zeros((N, N), dtype='f')
    A[:] = np.random.randn(*A.shape)
    B[:] = np.random.randn(*B.shape)
    start = timer()
    C = VectorAdd(A, B)
    elapsed_time = timer() - start
    print("Time: {}".format(elapsed_time))
main()

结果：

$ python complex.py cpu 3000
Time: 0.010573603001830634
$ python complex.py cuda 3000
Time: 0.3956961739968392
$ python complex.py cpu 30
Time: 9.693001629784703e-06
$ python complex.py cuda 30
Time: 0.10848476299725007

任何想法为什么？

可能您的数组太小，操作太简单，无法抵消与GPU关联的数据传输成本。看到它的另一种看法是，您在时间上不公平，因为对于GPU，它还在定时内存转移时间，而不仅仅是处理时间。

尝试一些更具挑战性的示例，也许首先是元素明智的大矩阵乘法，然后是矩阵乘法。

最后，GPU的功能是对相同数据执行许多操作，因此您最终仅在数据传输成本后才支付。

尽管在NVIDIA的网站上进行了示例，用于显示"如何使用GPU"，但使用CPU的GPU可能会较慢，但使用CPU。主要是由于将数据复制到GPU的开销。

即使简单的数学计算也可能较慢。较重的计算已经显示出增益。我在一篇文章中将结果放在一起，显示了GPU，CUDA和Numpy

的速度提高简而言之，问题是哪个更大

CPU时间

或

复制到GPU GPU时间从GPU

复制

相关内容

最新更新

热门标签：