CUDA 线程块大小 1024 不起作用(cc=20， sm=21)

我的运行配置：- CUDA 工具包 5.5- NVidia Nsight Eclipse版- 优麒麟 12.04 x64- CUDA 设备是 NVidia GeForce GTX 560：cc=20，sm=21（如您所见，我最多可以使用 1024 个线程的块）

我在 iGPU（英特尔核芯显卡）上渲染我的显示器，因此我可以使用 Nsight 调试器。

但是，当我将线程设置为 960>时，我遇到了一些奇怪的行为。

法典：

#include <stdio.h>
#include <cuda_runtime.h>
__global__ void mytest() {
    float a, b;
    b = 1.0F;
    a = b / 1.0F;
}
int main(void) {
    // Error code to check return values for CUDA calls
    cudaError_t err = cudaSuccess;
    // Here I run my kernel
    mytest<<<1, 961>>>();
    err = cudaGetLastError();
    if (err != cudaSuccess) {
        fprintf(stderr, "error=%sn", cudaGetErrorString(err));
        exit (EXIT_FAILURE);
    }
    // Reset the device and exit
    err = cudaDeviceReset();
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to deinitialize the device! error=%sn",
                cudaGetErrorString(err));
        exit (EXIT_FAILURE);
    }
    printf("Donen");
    return 0;
}

而且......它不起作用。问题出在最后一行带有浮点除法的代码中。每次我尝试按浮点数除法时，我的代码都会编译，但不起作用。运行时的输出错误为：

错误 = 请求启动的资源过多

这是我在调试中得到的，当我单步执行时：

警告：检测到 Cuda API 错误：返回 cudaLaunch （0x7）

使用 -Xptxas -v 构建输出：

12:57:39 **** Incremental Build of configuration Debug for project block_size_test ****
make all 
Building file: ../src/vectorAdd.cu
Invoking: NVCC Compiler
/usr/local/cuda-5.5/bin/nvcc -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -G -g -O0 -m64 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -optf /home/vitrums/cuda-workspace/block_size_test/options.txt -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -odir "src" -M -o "src/vectorAdd.d" "../src/vectorAdd.cu"
/usr/local/cuda-5.5/bin/nvcc --compile -G -I"/usr/local/cuda-5.5/samples/0_Simple" -I"/usr/local/cuda-5.5/samples/common/inc" -O0 -g -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_21 -keep -keep-dir /home/vitrums/cuda-workspace-trashcan -m64 -optf /home/vitrums/cuda-workspace/block_size_test/options.txt  -x cu -o  "src/vectorAdd.o" "../src/vectorAdd.cu"
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
../src/vectorAdd.cu(7): warning: variable "a" was set but never used
ptxas info    : 4 bytes gmem, 8 bytes cmem[14]
ptxas info    : Function properties for _ZN4dim3C1Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Compiling entry function '_Z6mytestv' for 'sm_21'
ptxas info    : Function properties for _Z6mytestv
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 8 bytes cumulative stack size, 32 bytes cmem[0]
ptxas info    : Function properties for _ZN4dim3C2Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
Finished building: ../src/vectorAdd.cu
Building target: block_size_test
Invoking: NVCC Linker
/usr/local/cuda-5.5/bin/nvcc --cudart static -m64 -link -o  "block_size_test"  ./src/vectorAdd.o   
Finished building target: block_size_test

12:57:41 Build Finished (took 1s.659ms)

当我添加 -keep 键时，编译器会生成 .cubin 文件，但我无法读取它以找出 smem 和 reg 的值，遵循本主题太多资源请求启动如何查找资源资源/。至少现在这个文件必须有一些不同的格式。

因此，我被迫每个块使用 256 个线程，考虑到这个.xls：CUDA_Occupancy_calculator，这可能不是一个坏主意。

无论如何。任何帮助将不胜感激。

我用当前信息填充了 CUDA 占用计算器文件：

计算能力： 2.1
每块线程数： 961
每个线程的寄存器： 34
共享内存： 0

我得到了0%的入住率，受寄存器数量的限制。
如果将线程数设置为 960，则占用率为 63%，这解释了它的工作原理。

尝试将寄存器计数限制为 32，并将线程数设置为 1024，以占用率为 67%。

要限制寄存器计数，请使用以下选项： nvcc [...] --maxrregcount=32

相关内容

最新更新

热门标签：