我正在为一个库编写一个函数,该库接受由2次方元素组成的大型数组(在GPU内存中)。此函数必须对不连续的子数组(长度相等,也是2的幂)求和,以生成较小(或很少相等大小)的数组。这是此处描述的OpenMP函数的GPU版本。
例如,
in[8] = { ... }
out = { in[0]+in[2], in[1]+in[3], in[4]+in[6], in[5]+in[7] }
缩减子阵列的元素由用户给定的比特索引列表bitInds
来确定。它们通知如何将in
的元素的索引(想象为比特串)映射到out
的索引。
例如,如果
bits = { 0, 2 }
则in
索引0=00映射到out[0]
并且in
索引4=100映射至out[2]
。
比特的长度可以在1
到log2(length(in))
的范围内。CCD_ 10的长度为CCD_。这意味着out
可以和in
一样大,或者一半大,或者四分之一大,等等
为该功能准备设备内存,并在CUDA内核中执行结构良好的精简似乎很有挑战性,我不确定如何开始。由于in
非常大,因此线程在本地和顺序地访问in
是很重要的。如何有效地执行in
的可能不连续子阵列的pow(2,length(bits))
缩减?
我尝试了三种实现,并比较了它们的性能。
- A:根据@RobertCrovella的例子,在每个块的独占全局内存上使用原子,然后在块之间减少
- B:让每个线程直接原子地写入全局内存
- C:让线程写入块本地共享内存,并最终将它们原子化地减少为全局内存
性能和大小约束方面的最佳解决方案出人意料地是最简单的,B。我提出了伪代码,并按照其复杂性的顺序描述了这些解决方案的权衡。
B
这个方法让每个线程原子地写入全局内存,直接写入最终输出。除了在全局存储器中拟合out
和in
之外,它对它们的大小没有施加额外的约束。
__global__ void myfunc_kernel(double* in, inLen, double* out, ...) {
// each thread handles one element of 'in'
int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// determine 'outInd' from other args
int outInd = ...
// atomically add directly to out global memory (vs all threads)
atomicAdd(&out[outInd], in[inInd]);
}
void myfunc(double* in, int inLen, double* out, ...) {
// copy unspecified additional args into device memory (shared, if fits)
double d_in = ...
// create global memory for final output (initialised to zero)
double d_out = ...
cudaMemset(d_out, 0, ...);
// allocate one thread per element of 'in'
int numThreadsPerBlock = 128;
int numBlocks = ceil(inLen / (double) numThreadsPerBlock);
myfunc_kernel<<numBlocks, numThreadsPerBlock>>(d_in, inLen, d_out, ...);
// copy output to host, and clean up
cudaMemcpy(out, d_out, ...);
cudaFree(...);
}
C
该方法为每个块分配共享内存,通过原子将块中的线程本地减少到共享内存中。此后,共享内存在块之间通过原子减少到输出全局内存中。必须将本地out
适配到共享存储器中将outLen
约束为最多(例如)4096
(取决于计算能力)。
__global__ void myfunc_kernel(double* in, int inLen, double* out, ...) {
// each thread handles one element of 'in'
int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// each block has a local copy of 'out', in shared memory
extern __shared__ double outBlock[];
// efficiently initialise shared memory to zero
// (looped in case fewer threads than outLen)
for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
outBlock[i] = 0;
__syncthreads();
// determine 'outInd' from other args
int outInd = ...
// atomically contribute thread's 'in' value to block shared memory
// (vs other threads in block)
atomicAdd(&outBlock[outInd], in[inInd]);
__syncthreads();
// keep one (or fewer) thread(s) per output element alive in each block
if (threadIdx.x >= outLen)
return;
// remaining threads atomically reduced shared memory into global
// (looped in-case fewer threads than 'outLen')
for (int i=threadIdx.x; i<outLen; i+=blockDim.x)
// (vs all surviving threads)
atomicAdd(&out[i], outBlock[i]);
/* This final reduction may be possible in parallel/tiers,
* though documentation & examples about reducing shared memory
* between blocks is disappointingly scarce
*/
}
void myFunc(double* in, int inLen, double* out, int outLen, ...) {
// if 'out' cannot fit in each block's shared memory, error and quit
if (outLen > 4096)
...
// copy unspecified additional args into device memory
double d_in = ...
// create global memory for final output (initialised to zero)
double d_out = ...
cudaMemset(d_out, 0, ...);
// allocate one thread per element of 'in'
int numThreadsPerBlock = 128;
int numBlocks = ceil(inLen / (double) numThreadsPerBlock);
// allocate shared memory to fit 'out' in each block
size_t mem = outLen * sizeof(double);
myfunc_kernel<<numBlocks, numThreadsPerBlock, mem>>(d_in, inLen, d_out, ...);
// copy output to host, and clean up
cudaMemcpy(out, d_out, ...);
cudaFree(...);
}
请注意,所有还原最终都是原子进行的。这不应该是必要的,因为在共享内存最终确定之后,全局内存的缩减具有严格的结构。然而,我在网上找不到关于如何将共享内存阵列集体减少为全局内存的文档或演示。
A
此方法使用与此直方图示例相同的范例。它在全局内存中创建out
的#numBlocks
副本,以便每个块写入独占内存。块中的线程以原子方式写入其块的全局内存分区。此后,第二内核将这些分区缩减为最终的out
阵列。这显著地限制了outLen
小于deviceGlobalMem/numBlocks
。
__global__ void myfunc_populate_kernel(double* in, int inLen, double* outs, int outLen, ...) {
// each thread handles one element of 'in'
int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// determine 'outInd' from other args
int outInd = ...
// map to this block's exclusive subarray in global memory
int blockOutInd = blockIdx.x*outLen + outInd;
// atomically contribute thread's 'in' value to block's global memory
// (vs threads in same block)
atomicAdd(&outs[outInd], in[inInd]);
}
__global__ void myfunc_reduce_kernel(double* out, double* outs, int outLen, int numOuts) {
// each thread handles one element of 'out'
int outInd = blockIdx.x*blockDim.x + threadIdx.x;
if (outInd >= outLen)
return;
// and determines it by reducing previous per-block partitions
double total = 0;
for (int n=0; n<numOuts; n++)
total += outs[outInd + n*outLen];
out[outInd] = total;
}
void myfunc(double* in, int inLen, double* outs, int outLen, ...) {
int numThreadsPerBlock = 128;
int numPopBlocks = ceil(inLen / (double) numThreadsPerBlock);
// if #numPopBlocks copies of 'out' cannot fit in global memory, error and exit
if (outLen * numPopBlocks > ...)
...
// copy unspecified additional args into device memory (shared, if fits)
double d_in = ...
// create a working 'out' for each block, in global memory (init to zero)
size_t mem = outLen * numPopBlocks * sizeof(double);
double d_outs = ...
cudaMalloc(&d_outs, mem);
cudaMemset(d_out, 0, mem);
// populate each blocks partition of global memory 'd_outs'
myfunc_populate_kernel<<numPopBlocks, numThreadsPerBlock>>(d_in, inLen, d_outs, outLen, ...);
// create global memory for final output
double d_out = ...
// reduce each blocks partition of global memory into output 'd_out'
int numReduceBlocks = ceil(outLen / (double) numThreadsPerBlock);
myfunc_reduce_kernel<<numReduceBlocks, numThreadsPerBlock>>(d_out, d_outs, outLen, numPopBolocks);
//copy output to host, and clean up
cudaMemcpy(out, d_out, ...);
cudaFree(...);
}
比较
我用Quadro P6000(24GB,6.1 CC)、inLen = 2^5, 2^10, 2^15, 2^20, 2^25
和(所有值)outLen = m2^5, 2^10, 2^15
(outLen <= inLen
)测试了这些方法。下面是一个运行时表:
╔══════════╦═════════╦══════════╦══════════╦══════════╗
║ inLen ║ outLen ║ A ║ B ║ C ║
╠══════════╬═════════╬══════════╬══════════╬══════════╣
║ 32 ║ 32 ║ 0.000067 ║ 0.000071 ║ 0.000039 ║
║ 1024 ║ 32 ║ 0.000244 ║ 0.000044 ║ 0.000071 ║
║ 1024 ║ 1024 ║ 0.000265 ║ 0.000043 ║ 0.000064 ║
║ 32768 ║ 32 ║ 0.000290 ║ 0.000077 ║ 0.000080 ║
║ 32768 ║ 1024 ║ 0.000287 ║ 0.000059 ║ 0.000096 ║
║ 32768 ║ 32768 ║ 0.000762 ║ 0.000111 ║ N/A ║
║ 1048576 ║ 32 ║ 0.000984 ║ 0.000793 ║ 0.000254 ║
║ 1048576 ║ 1024 ║ 0.001381 ║ 0.000343 ║ 0.002302 ║
║ 1048576 ║ 32768 ║ 0.017547 ║ 0.000899 ║ N/A ║
║ 1048576 ║ 1048576 ║ N/A ║ 0.006092 ║ N/A ║
║ 33554432 ║ 32 ║ 0.021068 ║ 0.022684 ║ 0.008079 ║
║ 33554432 ║ 1024 ║ 0.032839 ║ 0.008190 ║ 0.014071 ║
║ 33554432 ║ 32768 ║ 0.013734 ║ 0.011915 ║ N/A ║
╚══════════╩═════════╩══════════╩══════════╩══════════╝
尽管限制较少,但方法B显然是赢家,如图所示:
╔══════════════╦══════╦═════╗
║ outLen/inLen ║ A/B ║ C/B ║
╠══════════════╬══════╬═════╣
║ 1 ║ 0.9 ║ 0.5 ║
║ 32 ║ 5.5 ║ 1.6 ║
║ 1 ║ 6.2 ║ 1.5 ║
║ 1024 ║ 3.8 ║ 1.0 ║
║ 32 ║ 4.9 ║ 1.6 ║
║ 1 ║ 6.9 ║ ║
║ 32768 ║ 1.2 ║ 0.3 ║
║ 1024 ║ 4.0 ║ 6.7 ║
║ 32 ║ 19.5 ║ ║
║ 1 ║ ║ ║
║ 1048576 ║ 0.9 ║ 0.4 ║
║ 32768 ║ 4.0 ║ 1.7 ║
║ 1024 ║ 1.2 ║ ║
╚══════════════╩══════╩═════╝
陷阱
很难对CUDA文档和在线示例的状态保持礼貌。因此,我不发表评论,但向困惑的读者提供以下陷阱警告:
- 注意,虽然共享内存在内核的第一次执行时似乎被初始化为零,但在随后的执行中不会被重新初始化。因此,您必须在内核中手动重新初始化(没有方便的CUDA函数)
atomicAdd
缺少double
类型的重载。该文档声称这仅适用于计算能力<60,但我的代码不会在没有手动过载的情况下使用我的62 CC GPU进行编译。此外,文档中#if __CUDA_ARCH__ < 600
的代码片段是错误的,因为较新的arch预编译器没有定义__CUDA_ARCH__
。在此处查看正确的宏- 请注意,您在线查看的流行的并行归约范式会假设所有数据都在全局内存中,或者在单个块的共享内存中。将共享内存简化为全局结构似乎很深奥
确定bitInd
为了让感兴趣的读者更好地理解out
访问模式,以下是如何从每个线程中确定bitInd
的,其中(在组合的所有块之间)有inLen
。注意,实际的索引类型是long long int
,因为inLen
和outLen
都可以和2^30 = 1073741824
一样大
__global__ void myfunc_kernel(
double* in, int inLen, double* out, int* bitInds, int numBits
) {
long long int inInd = blockIdx.x*blockDim.x + threadIdx.x;
if (inInd >= inLen)
return;
// determine the number encoded by bits {bitInds} of inInd
long long int outInd = 0;
for (int b=0; b<numBits; b++) {
int bit = (inInd >> bitInds[b]) & 1;
outInd += bit * (1LL << b);
}
...
}
对于bitInds
的一些可能输入,访问模式很简单。例如:
- 如果
bitInds = {0, 1, ..., n-1}
(从0
连续增加),则outInd = inInd % 2^n
,并且我们的内核(次优)执行in
的连续分区的减少 - 如果是
bitInds = {n, n+1, n+2, ..., log2(inLen)-1}
,那么是outInd = floor(inInd / 2^(log2(inLen) - n))
,或者简单地说是outInd = floor(inInd 2^n / inLen)
然而,如果bitInds
是索引{0, ..., log2(inLen)-1}
的任意有序子集,则必须根据如上所述的每个比特来确定outInd
。