重复使用的共享块内存的Cuda未对齐地址



我的内核为数据存储分配了一个共享内存,但如果我更改了共享内存的大小,就会报告错误,请参阅附件中的代码。

#include <stdio.h>
#include <assert.h>
#define cucheck_dev(call)                                   
{                                                           
cudaError_t cucheck_err = (call);                         
if(cucheck_err != cudaSuccess) {                          
const char *err_str = cudaGetErrorString(cucheck_err);  
printf("%s (%d): %sn", __FILE__, __LINE__, err_str);   
assert(0);                                              
}                                                         
}
__global__ void kernel(int datanum)
{
extern __shared__ int sh[];
// assign data for data 1
float2* data_ptr1((float2*)sh);
for (int thid = threadIdx.x; thid < datanum; thid += blockDim.x)
{
data_ptr1[thid] = make_float2(0., 0.);
}
__syncthreads();
// assign data for data 2

size_t shOffset = (sizeof(float2)/sizeof(int)*(datanum));
if(threadIdx.x == 0) printf("Offset: %dn", (int)(shOffset));
__syncthreads();
float4 *data_ptr2((float4*)&sh[shOffset]);
for (int thid = threadIdx.x; thid < datanum; thid += blockDim.x)
{
data_ptr2[thid] = make_float4(0., 0., 0., 0.);
}
__syncthreads();
}
int main()
{
int datanum = 21;     // bug reports for datanum = 21, but everthing works fine for datanum = 20
int blocknum = 1;
int threadperblock = 128;
int preallocated = 768;
size_t shmem = datanum*sizeof(float2) + preallocated*sizeof(int);
printf("Allocated Shared memory byte: %d  Nums: %dn", (int)shmem, (int)(shmem/sizeof(int)));
kernel<<<blocknum, threadperblock, shmem>>>(datanum);
cudaDeviceSynchronize();
cucheck_dev(cudaGetLastError());
}
  • 操作系统:Ubuntu 18.02
  • 库达:10.1
  • 设备:RTX 2060
  • g++:7.5.0

如图所示,共享内存包括两个区域,一个用于固定数据,类型为float2

其他区域可以将不同类型保存为intfloat4,从共享存储器条目偏移。

当我将datanum设置为20时,代码工作正常。

但当datanum更改为21时,代码会报告一个未对齐的地址。

我非常感谢您的回复或建议。

谢谢!

cuda-memcheck提供的一些信息发布在这里供参考:

========= Invalid __shared__ write of size 16
=========     at 0x00000280 in kernel(int)
=========     by thread (20,0,0) in block (0,0,0)
=========     Address 0x000001e8 is misaligned
=========     Device Frame:kernel(int) (kernel(int) : 0x280)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2efabe]
=========     Host Frame:test [0x13de9]
=========     Host Frame:test [0x13e77]
=========     Host Frame:test [0x4a1c5]
=========     Host Frame:test [0x6f32]
=========     Host Frame:test [0x6df5]
=========     Host Frame:test [0x6e2e]
=========     Host Frame:test [0x6c14]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:test [0x69fa]
=========
========= Invalid __shared__ write of size 16
=========     at 0x00000280 in kernel(int)
=========     by thread (19,0,0) in block (0,0,0)
=========     Address 0x000001d8 is misaligned
=========     Device Frame:kernel(int) (kernel(int) : 0x280)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2efabe]
=========     Host Frame:test [0x13de9]
=========     Host Frame:test [0x13e77]
=========     Host Frame:test [0x4a1c5]
=========     Host Frame:test [0x6f32]
=========     Host Frame:test [0x6df5]
=========     Host Frame:test [0x6e2e]
=========     Host Frame:test [0x6c14]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:test [0x69fa]
=========
========= Invalid __shared__ write of size 16
=========     at 0x00000280 in kernel(int)
=========     by thread (18,0,0) in block (0,0,0)
=========     Address 0x000001c8 is misaligned
=========     Device Frame:kernel(int) (kernel(int) : 0x280)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2efabe]
=========     Host Frame:test [0x13de9]
=========     Host Frame:test [0x13e77]
=========     Host Frame:test [0x4a1c5]
=========     Host Frame:test [0x6f32]
=========     Host Frame:test [0x6df5]
=========     Host Frame:test [0x6e2e]
=========     Host Frame:test [0x6c14]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:test [0x69fa]
=========
========= Invalid __shared__ write of size 16
=========     at 0x00000280 in kernel(int)
=========     by thread (17,0,0) in block (0,0,0)
=========     Address 0x000001b8 is misaligned
=========     Device Frame:kernel(int) (kernel(int) : 0x280)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2efabe]
=========     Host Frame:test [0x13de9]
=========     Host Frame:test [0x13e77]
=========     Host Frame:test [0x4a1c5]
=========     Host Frame:test [0x6f32]
=========     Host Frame:test [0x6df5]
=========     Host Frame:test [0x6e2e]
=========     Host Frame:test [0x6c14]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
=========     Host Frame:test [0x69fa]

您的问题是float4的对齐度高于float2的对齐度。因此线路

size_t shOffset = (sizeof(float2)/sizeof(int)*(datanum));
float4 *data_ptr2((float4*)&sh[shOffset]);

除非datanum是偶数,否则不要保证dataptr2正确对齐。

我在这里为这个问题写了一些代码:CUDA:文档中的共享内存对齐

最简单的解决方案是只交换data_ptr1data_ptr2。将存储器的正面用于具有较大对齐的类型

最新更新