为什么增加CUDA的块数量增加了时间

我的理解是，在cuda中，增加块数量不会增加它们的时间，但是在我的代码中，如果我加倍块数，则时间加倍也。

#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#define num_of_blocks 500
#define num_of_threads 512
__constant__ double y = 1.1;
// set seed for random number generator
__global__ void initcuRand(curandState* globalState, unsigned long seed){
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    curand_init(seed, idx, 0, &globalState[idx]);
}
// kernel function for SIR
__global__ void test(curandState* globalState, double *dev_data){
    // global threads id
    int idx     = threadIdx.x + blockIdx.x * blockDim.x;
    // local threads id
    int lidx    = threadIdx.x;
    // creat shared memory to store seeds
    __shared__ curandState localState[num_of_threads];
    // shared memory to store samples
    __shared__ double sample[num_of_threads];
    // copy global seed to local
    localState[lidx]    = globalState[idx];
    __syncthreads();
    sample[lidx]    =  y + curand_normal_double(&localState[lidx]);
    if(lidx == 0){
        // save the first sample to dev_data;
        dev_data[blockIdx.x] = sample[0];
    }
    globalState[idx]    = localState[lidx];
}
int main(){
    // creat random number seeds;
    curandState *globalState;
    cudaMalloc((void**)&globalState, num_of_blocks*num_of_threads*sizeof(curandState));
    initcuRand<<<num_of_blocks, num_of_threads>>>(globalState, 1);
    double *dev_data;
    cudaMalloc((double**)&dev_data, num_of_blocks*sizeof(double));
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    // Start record
    cudaEventRecord(start, 0);
    test<<<num_of_blocks, num_of_threads>>>(globalState, dev_data);
    // Stop event
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop); // that's our time!
    // Clean up:
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    std::cout << "Time ellapsed: " << elapsedTime << std::endl;
    cudaFree(dev_data);
    cudaFree(globalState);
    return 0;
}

测试结果是：

number of blocks: 500, Time ellapsed: 0.39136.
number of blocks: 1000, Time ellapsed: 0.618656.

那么，时间会增加的原因是什么？是因为我访问常数内存，还是将数据从共享存储器复制到全局内存？这是一些优化的方法吗？

虽然能够并行运行的块数量可能很大，但由于片上资源有限，它仍然是有限的。如果内核启动中要求的块数量超过该限制，则任何其他块都必须等待较早的块完成并释放其资源。

一个有限的资源是共享内存，其中内核使用28千元。CUDA 8.0兼容的NVIDIA GPU提供每个流多处理器（SM）的48至112千字节共享内存，以便任何一次运行的最大块数在1×和3×3×3×3×3×3×sms上的sms数量。p>其他有限的资源是调度程序中的寄存器和各种每次盘中资源。CUDA占用计算器是一个方便的Excel电子表格（也可以与OpenOffice/libreoffice一起使用），它向您展示了这些资源如何限制特定内核的每个SM块数量。编译内核添加选项--ptxas-options="-v"到nvcc命令行，找到" PTXAS INFO：使用 xx 寄存器， yy bytes smem， zz xx > bytes cmem [0]， ww bytes cmem [2]"，然后输入 xx ， yy ，每个块的线程数试图启动，并将GPU的功能计算到电子表格中。然后，它将显示可以在一个SM.

上并行运行的最大块数量。

您没有提及您一直在进行测试的GPU，因此我将以GTX 980为例。它具有16个SMS，每个SMS都有96KB的共享内存，因此最多可以并行运行16×3 = 48个块。如果您不使用共享内存，则居民扭曲的最大数量将限制每SM的块数量到4，从而使64个块并行运行。

在任何当前现有的NVIDIA GPU上，您的示例至少需要依次执行大约十二个块，解释了为什么将块数量加倍也将大约是运行时的一倍。

相关内容

最新更新

热门标签：