CUDA三维图像坐标

我有一张尺寸为512*512*512的3D图像。我必须单独处理所有的体素。然而，我无法获得正确的维度来获得x、y和z值来获得像素。

在我的内核中，我有：

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;

我正在使用运行程序

Kernel<<<dim3(8,8), dim3(8,8,16)>>>();

我之所以选择这些，是因为有64个块，每个块有1024个线程，这应该会给我带来每个像素。然而，当我有这些尺寸时，我如何获得坐标值。。。

当调用内核函数时，我必须设置一些维度，x、y和z值实际上从0到511。（这给了我每个像素的位置）。但我尝试的每一个组合，我的内核要么不运行，要么运行，但值不够高。

该程序应该使它成为可能，这样每个内核都可以获得一个与图像中的像素对应的（x，y，z）像素。用最简单的方法，我只是想打印坐标，看看它是否打印出所有坐标。

有什么帮助吗？

编辑：

我的GPU属性：

Compute capability: 2.0
Name: GeForce GTX 480

我的程序代码只是为了测试它：

#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
// Device code
__global__ void Kernel()
{
    // Here I should somehow get the x, y and z values for every pixel possible in the 512*512*512 image
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int z = blockIdx.z * blockDim.z + threadIdx.z;
    printf("Coords: (%i, %i, %i)n", x, y, z);
}
// Host code
int main(int argc, char** argv) {
    Kernel<<<dim3(8, 8), dim3(8,8,16)>>>(); //This invokes the kernel
    cudaDeviceSynchronize();
    return 0;
}

要用所显示的索引覆盖512x512x512空间（即每个体素一个线程），您需要一个内核启动，如下所示：

Kernel<<<dim3(64,64,64), dim3(8,8,8)>>>();

当我乘以任何维度分量时：

64*8

我得到512。这给了我一个网格，在3个维度中的每个维度中有512个线程。您的索引将按照这种安排工作，为每个体素生成一个唯一的线程。

上面假设了cc2.0或更高的设备（你提到的每个块1024个线程表明你有一个cc2.0+设备），它允许3D网格。如果你有一个cc1.x设备，你将需要修改你的索引。

在这种情况下，你可能想要这样的东西：

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = (blockIdx.y%64) * blockDim.y + threadIdx.y;
int z = (blockIdx.y/64) * blockDim.z + threadIdx.z;

以及像这样的内核发布：

Kernel<<<dim3(64,4096), dim3(8,8,8)>>>();

以下是一个完整的示例（cc2.0），基于您现在显示的代码：

$ cat t604.cu
#include <stdio.h>
#define cudaCheckErrors(msg) 
    do { 
        cudaError_t __err = cudaGetLastError(); 
        if (__err != cudaSuccess) { 
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)n", 
                msg, cudaGetErrorString(__err), 
                __FILE__, __LINE__); 
            fprintf(stderr, "*** FAILED - ABORTINGn"); 
            exit(1); 
        } 
    } while (0)
// Device code
__global__ void Kernel()
{
    // Here I should somehow get the x, y and z values for every pixel possible in the 512*512*512 image
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int z = blockIdx.z * blockDim.z + threadIdx.z;
    if ((x==511)&&(y==511)&&(z==511)) printf("Coords: (%i, %i, %i)n", x, y, z);
}
// Host code
int main(int argc, char** argv) {
    cudaFree(0);
    cudaCheckErrors("CUDA is not working correctly");
    Kernel<<<dim3(64, 64, 64), dim3(8,8,8)>>>(); //This invokes the kernel
    cudaDeviceSynchronize();
    cudaCheckErrors("kernel fail");
    return 0;
}
$ nvcc -arch=sm_20 -o t604 t604.cu
$ cuda-memcheck ./t604
========= CUDA-MEMCHECK
Coords: (511, 511, 511)
========= ERROR SUMMARY: 0 errors
$

请注意，我选择只打印一行。我不想费力地完成512x512x512行的打印输出，运行需要很长的时间，而且在内核中printf的输出量无论如何都是有限的。

相关内容

最新更新

热门标签：