我想写一个Cuda内核来求和数组中给定的(连续的(索引范围。例如,输入数组是arr=[1]*10
,我想要3个和-sum(arr[0:2]), sum(arr[2:3]), sum(arr[3:10])
,所以输出应该是[2, 1, 7]
。
我的数组是大的2维数组(所以我想用相同的索引对每一行进行求和(,维度通常在1000乘100000左右,要求和的索引子范围变化很大(在1到>1000之间(。阵列已经作为Pytorch张量在GPU上,因此为此目的在CPU之间来回移动阵列是昂贵的。
我写了下面的Numba内核(这里有一个最小的工作示例(。基本上,每个线程都负责一个源列。它查找相关的目标列(w.r.t.索引范围(,并将该列添加到目标中。
from numba import cuda
import numpy as np
@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
pos = cuda.grid(1)
if pos>=arr.shape[1]: return
for i in range(len(idxs)):
if idxs[i]<=pos<idxs[i+1]:
thread_idx = i
break
for i in range(arr.shape[0]):
cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])
arr = np.ones(shape=(3, 10))
idxs = np.array([0, 2, 3, 10])
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1))
threads_per_block = 32
blocks_per_grid = ceil(arr.shape[1] / threads_per_block)
sum_idxs[threads_per_block, blocks_per_grid](arr, idxs, sum_arr)
print(sum_arr)
它给出了正确的结果
[[2. 1. 7.]
[2. 1. 7.]
[2. 1. 7.]]
并允许我根据需要在GPU上保持张量。
(为了简单起见,我在这里使用了numpy数组。在我的代码中,我使用cuda.as_cuda_array(tensor)
作为pytorch张量(
然而,这仍然是我的代码的一个主要性能瓶颈,有什么方法可以进一步优化它吗?
这里有一种可能的方法。分段缩减通常可以通过每段使用一个块来相当有效地实现(或者在这种情况下,我们将使用每行一个块(。如果段/行的数量足够大,这将倾向于使GPU饱和。
我将建议的代码设计将每行使用一个块,每个块将按顺序处理该行的3个段。为了处理分段,块将使用使用块步长循环实现的标准CUDA缩减来进行初始数据收集。
这里有一个例子,修复了注释中提到的代码中的一些内容(正确的网格尺寸,转换为float32
(:
$ cat t73.py
from numba import cuda,float32,int32
import numpy as np
import math
#TPB = threads per block, max of 1024
#TPB must be the power-of-2 expressed in TPBP2, i.e. TPB = 2**TPBP2
TPB = 1024
H = TPB//2
TPBP2 = 10
@cuda.jit
def sum_idxs(arr, idxs, sum_arr):
pos = cuda.grid(1)
if pos>=arr.shape[1]: return
for i in range(len(idxs)):
if idxs[i]<=pos<idxs[i+1]:
thread_idx = i
break
for i in range(arr.shape[0]):
cuda.atomic.add(sum_arr, (i, thread_idx), arr[i, pos])
@cuda.jit
def sum_idxs_i(arr, idxs, sum_arr):
s = cuda.shared.array(shape=(TPB), dtype=float32)
tx = cuda.threadIdx.x
row = cuda.blockIdx.x
#process each of the 3 segments in a row
for j in range(3):
lower = idxs[j]
upper = idxs[j+1]
val = float32(0)
#block-stride loop to gather data from the segment
for i in range(tx+lower, upper, TPB):
val += arr[row, i]
#canonical shared-memory parallel reduction
s[tx] = val
mid = H
for i in range(TPBP2):
cuda.syncthreads()
if tx < mid:
s[tx] += s[tx+mid]
mid >>= 1
if tx == 0:
sum_arr[row, j] = s[0]
rows = 1000
cols = 100000
arr = np.ones(shape=(rows, cols),dtype=np.float32)
idxs = np.array([0, 2, 3, cols],dtype=np.int32)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = math.ceil(arr.shape[1] / TPB)
sum_idxs[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
sum_arr = np.zeros(shape=(arr.shape[0], len(idxs)-1),dtype=np.float32)
blocks_per_grid = (arr.shape[0])
sum_idxs_i[blocks_per_grid, TPB](arr, idxs, sum_arr)
print(sum_arr)
$ nvprof python t73.py
==4383== NVPROF is profiling process 4383, command: python t73.py
[[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
...
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]]
[[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
...
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]
[2.0000e+00 1.0000e+00 9.9997e+04]]
==4383== Profiling application: python t73.py
==4383== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 45.92% 287.93ms 6 47.988ms 1.1520us 144.09ms [CUDA memcpy HtoD]
44.88% 281.42ms 6 46.903ms 1.4720us 140.74ms [CUDA memcpy DtoH]
8.46% 53.052ms 1 53.052ms 53.052ms 53.052ms cudapy::__main__::sum_idxs$241(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<float, int=2, C, mutable, aligned>)
0.75% 4.6729ms 1 4.6729ms 4.6729ms 4.6729ms cudapy::__main__::sum_idxs_i$242(Array<float, int=2, C, mutable, aligned>, Array<int, int=1, C, mutable, aligned>, Array<double, int=2, C, mutable, aligned>)
API calls: 43.26% 339.61ms 6 56.602ms 20.831us 193.89ms cuMemcpyDtoH
36.75% 288.52ms 6 48.087ms 15.434us 144.35ms cuMemcpyHtoD
18.66% 146.51ms 1 146.51ms 146.51ms 146.51ms cuDevicePrimaryCtxRetain
0.93% 7.3083ms 5 1.4617ms 4.8120us 6.7314ms cuMemFree
0.23% 1.8049ms 6 300.81us 9.4520us 778.85us cuMemAlloc
0.04% 327.52us 2 163.76us 156.34us 171.19us cuLinkAddData
0.04% 299.72us 2 149.86us 148.92us 150.80us cuModuleLoadDataEx
0.04% 276.32us 2 138.16us 131.16us 145.16us cuLinkComplete
0.02% 123.96us 2 61.978us 61.252us 62.704us cuLinkCreate
0.01% 64.406us 2 32.203us 29.439us 34.967us cuLaunchKernel
0.01% 63.184us 2 31.592us 30.251us 32.933us cuDeviceGetName
0.00% 29.454us 1 29.454us 29.454us 29.454us cuMemGetInfo
0.00% 20.732us 26 797ns 477ns 2.0320us cuCtxGetCurrent
0.00% 12.852us 25 514ns 363ns 1.0920us cuCtxGetDevice
0.00% 12.429us 2 6.2140us 1.7830us 10.646us cuDeviceGetPCIBusId
0.00% 5.0950us 10 509ns 302ns 1.0770us cuFuncGetAttribute
0.00% 3.9600us 2 1.9800us 1.8000us 2.1600us cuModuleGetFunction
0.00% 3.5630us 2 1.7810us 1.7510us 1.8120us cuLinkDestroy
0.00% 1.8970us 1 1.8970us 1.8970us 1.8970us cuCtxPushCurrent
0.00% 1.8370us 4 459ns 226ns 697ns cuDeviceGet
0.00% 1.6080us 6 268ns 181ns 481ns cuDeviceGetAttribute
0.00% 1.5060us 3 502ns 230ns 795ns cuDeviceGetCount
0.00% 1.2390us 2 619ns 428ns 811ns cuDeviceComputeCapability
$
此代码是在GTX960上运行的,它恰好通过bandwidthTest
CUDA示例代码报告了~84GB/s的设备内存带宽。在上面的例子中,我们看到改进的内核在~4.7ms内运行(大约比原始原子内核快10倍(,这转化为(1000*100000*4)bytes/4.7ms ~= 85GB/s
,所以我们可以得出结论,对于这个特定的测试用例,这个内核大约是";"最优";。