CUDA 3D到线性索引映射(Pitch)

我有三维数据，我正在使用CUDA进行处理。

我正在使用cudaMallocPitch（）分配内存。

cudaMallocPitch((void **)&test_data, &pitch, sizeof(float)*N*N, N);

在我代码的二维版本中，维度是N*N，我通过这样做访问了一个特定的元素：

i = blockIdx.x*BLOCK_X + threadIdx.x;
j = blockIdx.y*BLOCK_Y + threadIdx.y;
linearIdx = i + j*pitch/sizeof(float);

现在我希望将代码扩展到3-D。我可以获得类似的z索引

k = blockIdx.z*BLOCK_Z + threadIdx.z;

但是我现在如何将这三者结合起来得到线性指数呢？pitch究竟是什么？我现在如何访问该元素？请评论我为三维数据分配内存的方法是否正确。

谢谢！

在2D和3D阵列的编程指南中有几个很好的例子。它们如下：

2D：

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
                width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr,
                         size_t pitch, int width, int height)
{
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
            float element = row[c];
        }
    }
}

3D：

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
                                    height, depth);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);
// Device code
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth)
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for (int z = 0; z < depth; ++z) {
        char* slice = devPtr + z * slicePitch;
        for (int y = 0; y < height; ++y) {
            float* row = (float*)(slice + y * pitch);
            for (int x = 0; x < width; ++x) {
                float element = row[x];
            }
        }
    }
}

使用2D阵列，可以很容易地看到间距的使用情况。他们将数组指针投射到char*中的原因是pitch返回的是Byte大小，而不是元素数量（pitch可能不是元素大小的倍数）。

对于3D阵列，这是使用每个2D阵列的高度简单地扩展的。这类似于将3D结构展开为许多2D切片。

内存是一个1D连续的字节空间。1D、2D和3D访问模式取决于您如何解释数据，以及如何通过1D、二维和三维线程块访问数据。

cudaMallocPitch在设备上至少分配宽度（字节）*高度字节的线性内存。函数可以填充分配，以确保当地址从一行更新到另一行时，任何给定行中的相应指针将继续满足合并的对齐要求cudaMallocPitch（）返回的*pitch中的pitch是分配的宽度（以字节为单位）。--内存管理[CUDA运行时API]。

在M (rows) x N (cols) x K (slices)的数据的情况下，一个切片的每个像素将处于

i = blockIdx.x*BLOCK_X + threadIdx.x;
j = blockIdx.y*BLOCK_Y + threadIdx.y;
linearIdx = i + j*pitch/sizeof(float);

下一个像素切片是M x N位置分离的。因此，要访问您的数据，您必须正确地跳转到下一个切片。即

i = blockIdx.x*BLOCK_X + threadIdx.x;
j = blockIdx.y*BLOCK_Y + threadIdx.y;
// index for the slice
k = blockIdx.z*BLOCK_Z + threadIdx.z;
// for the sake of simplicity 
int next_row_pitched = pitch/sizeof(float);
linearIdx = i + j*next_row_pitched + k*next_row_pitched*N;

要获得"第三"维度，必须"跳跃"一个切片的所有像素，即M x N位置。由于每行都是倾斜的，所以必须通过cudaMallockPitch返回的倾斜值来更改M。

如果启动三维线程块，则上述索引有效。您还可以启动2D线程块，然后迭代切片的数量。

CUDA C编程指南第3.2.2章"设备内存"中有一个代码示例，用于分配浮点值的宽度×高度×深度的3D数组，并展示如何在设备代码中的数组元素上循环。

相关内容

最新更新

热门标签：