CUDA目标的Numba和guvectorize:代码运行速度低于预期

重要细节

大型数据集(1000万x 5)，(200 x 1000万x五)
大多麻木
每次跑步后需要更长时间
使用Spyder3
Windows 10

第一件事是尝试将guvectorize与以下函数一起使用。我传入了一堆numpy数组，并试图使用它们在其中两个数组中相乘。如果使用非梭鱼的目标运行，则此操作有效。然而，当切换到cuda时，会导致一个未知错误：

文件"C:\ProgramData\Anaconda3\lib\site packages\numba\cuda\decorators.py"，>第82行，在jitwrapper中debug=debug)

类型错误：init()获得了一个意外的关键字参数"debug">

在遵循了我从这个错误中所能找到的所有信息之后，我什么都没找到，只找到了死胡同。我想这是一个非常简单的修复，我完全错过了，但哦，好吧。还应该指出的是，只有在运行一次并因内存过载而崩溃后，才会出现此错误。

os.environ["NUMBA_ENABLE_CUDASIM"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...

所有阵列都是numpy

@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:], 
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)', 
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
for as_of_date in range(0,ed):
for ID in range(0,rowCount):
for num in range(0,n):
cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]

尝试在命令行中使用nvprofiler运行代码会导致以下错误：

警告：当前配置，因为一对没有对等支持的设备是否在此检测到？多GPU设置。当对等映射不是可用时，系统将恢复使用零拷贝内存。它可能导致访问统一内存的内核运行速度较慢。更多详细信息可以可在以下位置找到：http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-管理存储器

我意识到我使用的是启用了SLI的显卡(两张卡都是相同的，evga gtx 1080ti，并且具有相同的设备id)，所以我禁用了SLI并添加了"CUDA_VISIBLE_DEVICES"行，试图限制到另一张卡，但结果相同。

我仍然可以使用nvprof运行代码，但与njit(parallel=True)和prange相比，cuda函数速度较慢。通过使用较小的数据大小，我们可以运行代码，但它比target='parallel'和target='pu'慢。

为什么库达速度慢得多，这些错误意味着什么？

谢谢你的帮助！

编辑：以下是代码的一个工作示例：

import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer

@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
for as_of_date in range(0,countRow):
for ID in range(0,countCol):
for num in range(0,n):
cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))

我可以使用gtx 1080ti在cuda中运行代码，但是，它比并行或cpu运行要慢得多。我看过其他与guvectorize相关的帖子，但都没有帮助我理解在guvectorze中运行什么是最佳的，什么不是最佳的。有没有什么方法可以让这个代码"cuda友好"，或者只是在数组之间进行乘法太简单了，看不到任何好处？

首先，您所展示的基本操作是获取两个矩阵，将它们传输到GPU，进行一些元素乘法运算以生成第三个数组，并将第三个阵列传回主机。

可以制作一个numba/cuda-guvectorize(或cuda.jit内核)实现，它可能比简单的串行python实现运行得更快，但我怀疑是否有可能超过编写良好的主机代码的性能(例如，使用一些并行化方法，如guvectorze)来做同样的事情。这是因为在主机和设备之间传输的每个字节的算术强度太低。这个操作太简单了。

其次，我认为，重要的是，从理解numbavectorize和guvectorize的意图开始。基本原则是从"一个工人会做什么？"的角度编写ufunc定义，然后允许numba从中旋转多个工人。指示numba旋转多个工作线程的方法是传递一个数据集，该数据集大于您提供的签名。需要注意的是，numba不知道如何在ufunc定义中并行化for循环。它通过获取你的ufunc定义并在并行工作者之间运行来获得并行的"强度"，其中每个工作者处理数据的"切片"，但在该切片上运行你的整个ufunc的定义。作为补充阅读，我在这里也介绍了一些这方面的内容。

因此，我们在您的实现中遇到的一个问题是，您编写了一个签名(和ufunc)，它将整个输入数据集映射到一个工作者。正如@talonmies所展示的，您的底层内核总共有64个线程/工作线程(这在GPU上太小了，甚至除了上面关于算术强度的声明之外，都不太有趣)，但我怀疑事实上，64实际上只是一个最小的线程块大小，事实上，该线程块中只有一个线程在做任何有用的计算工作。这一个线程正在以串行方式执行整个ufunc，包括所有for循环。

这显然不是任何人合理使用vectorize或guvectorize的意图。

因此，让我们重新审视一下您正在尝试做的事情。最终，您的ufunc希望将一个数组的输入值与另一个数组中的输入值相乘，并将结果存储在第三个数组中。我们希望多次重复这一进程。如果所有3个阵列大小都相同，我们实际上可以用vectorize实现这一点，甚至不必使用更复杂的guvectorize。让我们将这种方法与您的原始方法进行比较，重点关注CUDA内核的执行。这里有一个工作示例，其中t14.py是您的原始代码，使用探查器运行，t15.py是它的vectorize版本，确认我们已经更改了multBy数组的大小以匹配cv和discount:

$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer

@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

我们看到，您的应用程序报告的运行时间约为1.244秒，而矢量化版本报告的运行时约为0.375秒。但这两个数字都有python开销。如果我们在探查器中查看生成的CUDA内核持续时间，则差异更加明显。我们看到，原始内核花费了大约1.225秒，而矢量化内核在大约842微秒(即小于1毫秒)内执行。我们还注意到，计算内核时间现在比将3个阵列传输到GPU或从GPU传输3个阵列所需的时间小得多(总共大约需要20毫秒)，并且我们注意到，内核尺寸现在是15625个块，每个块128个线程，线程/工作者总数为2000000，与要完成的乘法运算的总数完全匹配，并且远远超过了原始代码中微不足道的64个线程(可能实际上只有1个线程)。

鉴于上述vectorize方法的简单性，如果你真正想做的是这种元素乘法，那么你可以考虑只复制multBy，使其在维度上与其他两个数组匹配，并使用它完成

但问题仍然存在：如何处理不同的输入数组大小，就像最初的问题一样？为此，我认为我们需要转到guvectorize(或者，正如@talonmies所指出的，编写自己的@cuda.jit内核，这可能是最好的建议，尽管如前所述，这些方法都可能无法克服向设备传输数据的开销)。

为了用guvectorize解决这个问题，我们需要更加仔细地思考前面提到的"切片"概念。让我们重新编写您的guvectorize内核，使其仅对整体数据的"切片"进行操作，然后允许guvectorize启动函数启动多个工作程序来处理它，每个切片一个工作程序。

在CUDA，我们喜欢有很多工人；你真的不能有太多。因此，这将影响我们如何"切片"数组，从而为多个工作者提供行动的可能性。如果我们沿着第三个(最后一个，n)维度进行切片，我们将只有5个切片可供使用，因此最多有5个工人。同样，如果我们沿着第一个维度或countRow维度进行切片，我们将有100个切片，因此最多有100个工人。理想情况下，我们将沿着第二维度或countCol维度进行切片。然而，为了简单起见，我将沿着第一个维度(即countRow维度)进行切片。这显然不是最优的，但请参阅下面的一个工作示例，了解如何处理二维切片问题。按第一个维度进行切片意味着我们将从guvectorize内核中删除第一个for循环，并允许ufunc系统沿该维度并行化(基于我们传递的数组大小)。代码可能看起来像这样：

$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer

@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
307.05ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
307.79ms  5.9293ms                    -               -         -         -         -  15.259MB  2.5131GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.34ms  1.3440us                    -               -         -         -         -        8B  5.6766MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.54ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
317.27ms  47.398ms              (2 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms  7.3799ms                    -               -         -         -         -  15.259MB  2.0192GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

观察结果：

代码更改与删除countCol参数、从guvectorize内核中删除第一个for循环以及对函数签名进行适当更改以反映这一点有关。我们还将签名中的三维函数修改为二维函数。毕竟，我们正在对三维数据进行二维"切片"，并让每个工人在一个切片上工作。
探查器报告的内核维度现在是2个块，而不是1个。这是有道理的，因为在最初的实现中，实际上只有一个"切片"，因此需要一个工人，因此需要1个线程(但numba产生了一个由64个线程组成的线程块)。在这个实现中，有100个切片，numba选择旋转2个64个工作线程/线程的线程块，以提供所需的100个工作线程。
探查器报告的47.4ms的内核性能现在介于原始版本(约1.224s)和大规模并行vectorize版本(约0.001s)之间。因此，从1个工作线程增加到100个工作线程大大加快了速度，但可能会有更多的性能提升。如果您弄清楚如何在countCol维度上进行切片，您可能会更接近vectorize版本，从性能角度来看(请参阅下文)。请注意，我们现在的位置(~47ms)和矢量化版本(~1ms)之间的差异足以弥补将稍大的multBy矩阵传输到设备的额外传输成本(~5ms或更小)，以促进vectorize的简单性。

关于python计时的一些额外评论：我相信python编译原始、矢量化和guvectorize改进版本所需内核的确切行为是不同的。如果我们修改t15.py代码以运行"预热"运行，那么至少python的时间与整体墙时间和仅内核的时间是一致的

$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer

@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
return a * b
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839
real    0m2.522s
user    0m1.572s
sys     0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091
real    0m1.050s
user    0m0.473s
sys     0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283
real    0m1.252s
user    0m0.680s
sys     0m0.441s
$

现在，有效地回答了评论中的一个问题："我该如何将问题重新定义为沿着4000(countCol，或"中间")维度划分？">

我们可以被沿着第一维度进行切片的工作所引导。一种可能的方法是重新排列阵列的形状，使4000维成为第一维，然后删除它，类似于我们在先前处理guvectorize时所做的操作。下面是一个工作示例：

$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer

@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[num] * discount[ID][num]
countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
304.92ms  1.1840us                    -               -         -         -         -        8B  6.4437MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
305.36ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
306.08ms  6.0208ms                    -               -         -         -         -  15.259MB  2.4749GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.44ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.59ms  8.9961ms             (63 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms  7.2772ms                    -               -         -         -         -  15.259MB  2.0476GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

不出所料，我们观察到，执行时间已从100名工人时的约47ms下降到4000名工人时约9ms。类似地，我们观察到numba选择旋转63个块，每个块64个线程，总共4032个线程，以处理这种"切片"所需的4000名工人。

仍然不如~1ms的vectorize内核(它有更多可供工作者使用的并行"切片")快，但比最初问题中提出的~1.2s内核快得多。python代码的整体walltime大约快了2倍，即使有所有的python开销。

最后，让我们回顾一下我之前的陈述(与评论和另一个答案中的陈述类似)：

"我怀疑是否有可能超过编写良好的主机代码的性能(例如，使用一些并行化方法，如guvectorize)来做同样的事情。">

我们现在可以在t16.py或t17.py中使用方便的测试用例来测试这一点。为了简单起见，我将选择t16.py。我们可以通过从guvectorizeufunc:中删除目标名称来"将其转换回CPU代码">

$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer

@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
for ID in range(0,countCol):
for num in range(0,n):
cv[ID][num] = multBy[ID][num] * discount[ID][num]
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741
real    0m0.528s
user    0m0.474s
sys     0m0.047s
$

因此，我们看到，这个仅限CPU的版本在大约6毫秒内运行该功能，并且它没有GPU"开销"，如CUDA初始化和向GPU复制数据。整体壁时间也是我们最好的测量值，约为0.5秒，而我们最好的GPU情况下约为1.0秒。因此，这个特殊的问题，由于其每字节数据传输的算术强度低，可能不太适合GPU计算。

gufunc Numba发射和运行如此缓慢的原因在分析时立即变得显而易见(使用CUDA 8.0的Numba 0.38.1)

==24691== Profiling application: python slowvec.py
==24691== Profiling result:
Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
271.33ms  1.2800us                    -               -         -         -         -        8B  5.9605MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
271.65ms  14.591us                    -               -         -         -         -  156.25KB  10.213GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
272.09ms  2.5868ms                    -               -         -         -         -  15.259MB  5.7605GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
274.98ms     992ns                    -               -         -         -         -        8B  7.6909MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
275.17ms     640ns                    -               -         -         -         -        8B  11.921MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
276.33ms  657.28ms              (1 1 1)        (64 1 1)        40        0B        0B         -           -  GeForce GTX 970         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38]
933.62ms  3.5128ms                    -               -         -         -         -  15.259MB  4.2419GB/s  GeForce GTX 970         1         7  [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

由此产生的运行代码的内核启动是使用64个线程的单个块。在一个理论上每MP最多可以有2048个线程和23 MP的GPU上，这意味着你的GPU大约99.9%的理论处理能力没有被使用。这看起来像是numba开发人员的一个荒谬的设计选择，如果你受到它的阻碍，我会把它报告为一个bug(看起来你确实受到了阻碍)。

显而易见的解决方案是用CUDA python内核方言将函数重写为@cuda.jit函数，并显式控制执行参数。这样，您至少可以确保代码运行时有足够的线程来潜在地使用硬件的所有容量。这仍然是一个非常有内存限制的操作，因此您在加速方面所能实现的可能会被限制为大大低于GPU与CPU的内存带宽之比。这可能不足以分摊主机到设备内存传输的成本，因此在最好的情况下可能不会有性能提升，尽管这远非如此。

简而言之，要注意automagic编译器生成的并行性的危险。。。。

要补充的后记是，我设法弄清楚了如何获得numba发出的代码的PTX，可以说这绝对是蹩脚的(而且这么长时间我实际上无法发布所有代码)：

{
.reg .pred  %p<9>;
.reg .b32   %r<8>;
.reg .f64   %fd<4>;
.reg .b64   %rd<137>;

ld.param.u64    %rd29, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_5];
ld.param.u64    %rd31, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_11];
ld.param.u64    %rd32, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
ld.param.u64    %rd34, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_14];
ld.param.u64    %rd35, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_15];
ld.param.u64    %rd36, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_16];
ld.param.u64    %rd37, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_17];
ld.param.u64    %rd38, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_22];
ld.param.u64    %rd39, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_23];
ld.param.u64    %rd40, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_24];
ld.param.u64    %rd41, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_25];
ld.param.u64    %rd42, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_26];
ld.param.u64    %rd43, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_27];
ld.param.u64    %rd44, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_28];
ld.param.u64    %rd45, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_29];
ld.param.u64    %rd46, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_30];
ld.param.u64    %rd48, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_36];
ld.param.u64    %rd51, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_43];
ld.param.u64    %rd53, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_49];
ld.param.u64    %rd54, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_50];
ld.param.u64    %rd55, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_51];
ld.param.u64    %rd56, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_52];
ld.param.u64    %rd57, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_53];
ld.param.u64    %rd58, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_54];
ld.param.u64    %rd59, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_55];
ld.param.u64    %rd60, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_56];
ld.param.u64    %rd61, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_57];
mov.u32     %r1, %tid.x;
mov.u32     %r3, %ctaid.x;
mov.u32     %r2, %ntid.x;
mad.lo.s32  %r4, %r3, %r2, %r1;
min.s64     %rd62, %rd32, %rd29;
min.s64     %rd63, %rd39, %rd62;
min.s64     %rd64, %rd48, %rd63;
min.s64     %rd65, %rd51, %rd64;
min.s64     %rd66, %rd54, %rd65;
cvt.s64.s32 %rd1, %r4;
setp.le.s64 %p2, %rd66, %rd1;
@%p2 bra    BB0_8;
ld.param.u64    %rd126, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_42];
ld.param.u64    %rd125, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_44];
ld.param.u64    %rd124, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_35];
ld.param.u64    %rd123, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_37];
ld.param.u64    %rd122, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_4];
ld.param.u64    %rd121, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_6];
cvt.u32.u64 %r5, %rd1;
setp.lt.s32 %p1, %r5, 0;
selp.b64    %rd67, %rd29, 0, %p1;
add.s64     %rd68, %rd67, %rd1;
mul.lo.s64  %rd69, %rd68, %rd121;
add.s64     %rd70, %rd69, %rd122;
selp.b64    %rd71, %rd48, 0, %p1;
add.s64     %rd72, %rd71, %rd1;
mul.lo.s64  %rd73, %rd72, %rd123;
add.s64     %rd74, %rd73, %rd124;
ld.u64  %rd2, [%rd74];
selp.b64    %rd75, %rd51, 0, %p1;
add.s64     %rd76, %rd75, %rd1;
mul.lo.s64  %rd77, %rd76, %rd125;
add.s64     %rd78, %rd77, %rd126;
ld.u64  %rd3, [%rd78];
ld.u64  %rd4, [%rd70];
setp.lt.s64 %p3, %rd4, 1;
@%p3 bra    BB0_8;
ld.param.u64    %rd128, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_13];
ld.param.u64    %rd127, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
selp.b64    %rd80, %rd127, 0, %p1;
mov.u64     %rd79, 0;
min.s64     %rd81, %rd128, %rd79;
min.s64     %rd82, %rd34, %rd79;
selp.b64    %rd83, %rd39, 0, %p1;
min.s64     %rd84, %rd40, %rd79;
min.s64     %rd85, %rd41, %rd79;
min.s64     %rd86, %rd42, %rd79;
selp.b64    %rd87, %rd54, 0, %p1;
min.s64     %rd88, %rd55, %rd79;
min.s64     %rd89, %rd56, %rd79;
min.s64     %rd90, %rd57, %rd79;
mul.lo.s64  %rd91, %rd90, %rd61;
add.s64     %rd92, %rd53, %rd91;
mul.lo.s64  %rd93, %rd89, %rd60;
add.s64     %rd94, %rd92, %rd93;
mul.lo.s64  %rd95, %rd88, %rd59;
add.s64     %rd96, %rd94, %rd95;
add.s64     %rd98, %rd87, %rd1;
mul.lo.s64  %rd99, %rd58, %rd98;
add.s64     %rd5, %rd96, %rd99;
mul.lo.s64  %rd100, %rd86, %rd46;
add.s64     %rd101, %rd38, %rd100;
mul.lo.s64  %rd102, %rd85, %rd45;
add.s64     %rd103, %rd101, %rd102;
mul.lo.s64  %rd104, %rd84, %rd44;
add.s64     %rd105, %rd103, %rd104;
add.s64     %rd106, %rd83, %rd1;
mul.lo.s64  %rd107, %rd43, %rd106;
add.s64     %rd6, %rd105, %rd107;
mul.lo.s64  %rd108, %rd82, %rd37;
add.s64     %rd109, %rd31, %rd108;
mul.lo.s64  %rd110, %rd81, %rd36;
add.s64     %rd111, %rd109, %rd110;
add.s64     %rd112, %rd80, %rd1;
mul.lo.s64  %rd113, %rd35, %rd112;
add.s64     %rd7, %rd111, %rd113;
add.s64     %rd8, %rd2, 1;
mov.u64     %rd131, %rd79;
BB0_3:
mul.lo.s64  %rd115, %rd59, %rd131;
add.s64     %rd10, %rd5, %rd115;
mul.lo.s64  %rd116, %rd44, %rd131;
add.s64     %rd11, %rd6, %rd116;
setp.lt.s64 %p4, %rd3, 1;
mov.u64     %rd130, %rd79;
mov.u64     %rd132, %rd3;
@%p4 bra    BB0_7;
BB0_4:
mov.u64     %rd13, %rd132;
mov.u64     %rd12, %rd130;
mul.lo.s64  %rd117, %rd60, %rd12;
add.s64     %rd136, %rd10, %rd117;
mul.lo.s64  %rd118, %rd45, %rd12;
add.s64     %rd135, %rd11, %rd118;
mul.lo.s64  %rd119, %rd36, %rd12;
add.s64     %rd134, %rd7, %rd119;
setp.lt.s64 %p5, %rd2, 1;
mov.u64     %rd133, %rd8;
@%p5 bra    BB0_6;
BB0_5:
mov.u64     %rd17, %rd133;
ld.f64  %fd1, [%rd135];
ld.f64  %fd2, [%rd134];
mul.f64     %fd3, %fd2, %fd1;
st.f64  [%rd136], %fd3;
add.s64     %rd136, %rd136, %rd61;
add.s64     %rd135, %rd135, %rd46;
add.s64     %rd134, %rd134, %rd37;
add.s64     %rd24, %rd17, -1;
setp.gt.s64 %p6, %rd24, 1;
mov.u64     %rd133, %rd24;
@%p6 bra    BB0_5;
BB0_6:
add.s64     %rd25, %rd13, -1;
add.s64     %rd26, %rd12, 1;
setp.gt.s64 %p7, %rd13, 1;
mov.u64     %rd130, %rd26;
mov.u64     %rd132, %rd25;
@%p7 bra    BB0_4;
BB0_7:
sub.s64     %rd120, %rd4, %rd131;
add.s64     %rd131, %rd131, 1;
setp.gt.s64 %p8, %rd120, 1;
@%p8 bra    BB0_3;
BB0_8:
ret;
}

所有这些整数运算都可以精确执行一次双精度乘法！

相关内容

最新更新

热门标签：