优化我的Cuda内核以在torch张量内求和不同的索引范围



我想写一个Cuda内核来求和数组中给定的(连续的(索引范围。例如,输入数组是arr=[1]*10,我想要3个和-sum(arr[0:2]), sum(arr[2:3]), sum(arr[3:10]),所以输出应该是[2, 1, 7]

我的数组是大的2维数组(所以我想用相同的索引对每一行进行求和(,维度通常在1000乘100000左右,要求和的索引子范围变化很大(在1到>1000之间(。阵列已经作为Pytorch张量在GPU上,因此为此目的在CPU之间来回移动阵列是昂贵的。

我写了下面的Numba内核(这里有一个最小的工作示例(。基本上,每个线程都负责一个源列。它查找相关的目标列(w.r.t.索引范围(,并将该列添加到目标中。

from numba import cuda
import numpy as np
@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
pos = cuda.grid(1)
if pos>=arr.shape[1]: return
for i in range(len(idxs)):
if idxs[i]<=pos<idxs[i+1]:
thread_idx = i
break
for i in range(arr.shape[0]):
cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])
arr = np.ones(shape=(3, 10))
idxs = np.array([0, 2, 3, 10])
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1))
threads_per_block = 32
blocks_per_grid = ceil(arr.shape[1] / threads_per_block)
sum_idxs[threads_per_block, blocks_per_grid](arr, idxs, sum_arr)
print(sum_arr)

它给出了正确的结果

[[2. 1. 7.]
[2. 1. 7.]
[2. 1. 7.]]

并允许我根据需要在GPU上保持张量。

(为了简单起见,我在这里使用了numpy数组。在我的代码中,我使用cuda.as_cuda_array(tensor)作为pytorch张量(

然而,这仍然是我的代码的一个主要性能瓶颈,有什么方法可以进一步优化它吗?

这里有一种可能的方法。分段缩减通常可以通过每段使用一个块来相当有效地实现(或者在这种情况下,我们将使用每行一个块(。如果段/行的数量足够大,这将倾向于使GPU饱和。

我将建议的代码设计将每行使用一个块,每个块将按顺序处理该行的3个段。为了处理分段,块将使用使用块步长循环实现的标准CUDA缩减来进行初始数据收集。

这里有一个例子,修复了注释中提到的代码中的一些内容(正确的网格尺寸,转换为float32(:

$ cat t73.py
from numba import cuda,float32,int32
import numpy as np
import math
#TPB = threads per block, max of 1024
#TPB must be the power-of-2 expressed in TPBP2, i.e. TPB = 2**TPBP2    
TPB   = 1024
H     = TPB//2
TPBP2 = 10
@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
pos = cuda.grid(1)
if pos>=arr.shape[1]: return
for i in range(len(idxs)):
if idxs[i]<=pos<idxs[i+1]:
thread_idx = i
break
for i in range(arr.shape[0]):
cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])
@cuda.jit
def sum_idxs_i(arr, idxs, sum_arr):
s = cuda.shared.array(shape=(TPB), dtype=float32)
tx = cuda.threadIdx.x
row = cuda.blockIdx.x
#process each of the 3 segments in a row
for j in range(3):
lower = idxs[j]
upper = idxs[j+1]
val = float32(0)
#block-stride loop to gather data from the segment
for i in range(tx+lower, upper, TPB):
val += arr[row, i]
#canonical shared-memory parallel reduction
s[tx] = val
mid = H
for i in range(TPBP2):
cuda.syncthreads()
if tx < mid:
s[tx] += s[tx+mid]
mid >>= 1
if tx == 0:
sum_arr[row, j] = s[0]
rows = 1000
cols = 100000
arr = np.ones(shape=(rows, cols),dtype=np.float32)
idxs = np.array([0, 2, 3, cols],dtype=np.int32)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = math.ceil(arr.shape[1] / TPB)
sum_idxs[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = (arr.shape[0])
sum_idxs_i[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
$ nvprof python t73.py
==4383== NVPROF is profiling process 4383, command: python t73.py
[[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
...
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]]
[[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
...
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]]
==4383== Profiling application: python t73.py
==4383== Profiling result:
Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:   45.92%  287.93ms         6  47.988ms  1.1520us  144.09ms  [CUDA memcpy HtoD]
44.88%  281.42ms         6  46.903ms  1.4720us  140.74ms  [CUDA memcpy DtoH]
8.46%  53.052ms         1  53.052ms  53.052ms  53.052ms  cudapy::__main__::sum_idxs$241(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>)
0.75%  4.6729ms         1  4.6729ms  4.6729ms  4.6729ms  cudapy::__main__::sum_idxs_i$242(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<double, int=2, C, mutable, aligned>)
API calls:   43.26%  339.61ms         6  56.602ms  20.831us  193.89ms  cuMemcpyDtoH
36.75%  288.52ms         6  48.087ms  15.434us  144.35ms  cuMemcpyHtoD
18.66%  146.51ms         1  146.51ms  146.51ms  146.51ms  cuDevicePrimaryCtxRetain
0.93%  7.3083ms         5  1.4617ms  4.8120us  6.7314ms  cuMemFree
0.23%  1.8049ms         6  300.81us  9.4520us  778.85us  cuMemAlloc
0.04%  327.52us         2  163.76us  156.34us  171.19us  cuLinkAddData
0.04%  299.72us         2  149.86us  148.92us  150.80us  cuModuleLoadDataEx
0.04%  276.32us         2  138.16us  131.16us  145.16us  cuLinkComplete
0.02%  123.96us         2  61.978us  61.252us  62.704us  cuLinkCreate
0.01%  64.406us         2  32.203us  29.439us  34.967us  cuLaunchKernel
0.01%  63.184us         2  31.592us  30.251us  32.933us  cuDeviceGetName
0.00%  29.454us         1  29.454us  29.454us  29.454us  cuMemGetInfo
0.00%  20.732us        26     797ns     477ns  2.0320us  cuCtxGetCurrent
0.00%  12.852us        25     514ns     363ns  1.0920us  cuCtxGetDevice
0.00%  12.429us         2  6.2140us  1.7830us  10.646us  cuDeviceGetPCIBusId
0.00%  5.0950us        10     509ns     302ns  1.0770us  cuFuncGetAttribute
0.00%  3.9600us         2  1.9800us  1.8000us  2.1600us  cuModuleGetFunction
0.00%  3.5630us         2  1.7810us  1.7510us  1.8120us  cuLinkDestroy
0.00%  1.8970us         1  1.8970us  1.8970us  1.8970us  cuCtxPushCurrent
0.00%  1.8370us         4     459ns     226ns     697ns  cuDeviceGet
0.00%  1.6080us         6     268ns     181ns     481ns  cuDeviceGetAttribute
0.00%  1.5060us         3     502ns     230ns     795ns  cuDeviceGetCount
0.00%  1.2390us         2     619ns     428ns     811ns  cuDeviceComputeCapability
$

此代码是在GTX960上运行的,它恰好通过bandwidthTestCUDA示例代码报告了~84GB/s的设备内存带宽。在上面的例子中,我们看到改进的内核在~4.7ms内运行(大约比原始原子内核快10倍(,这转化为(1000*100000*4)bytes/4.7ms ~= 85GB/s,所以我们可以得出结论,对于这个特定的测试用例,这个内核大约是";"最优";。

相关内容

  • 没有找到相关文章

最新更新