我的内核为数据存储分配了一个共享内存,但如果我更改了共享内存的大小,就会报告错误,请参阅附件中的代码。
#include <stdio.h>
#include <assert.h>
#define cucheck_dev(call)
{
cudaError_t cucheck_err = (call);
if(cucheck_err != cudaSuccess) {
const char *err_str = cudaGetErrorString(cucheck_err);
printf("%s (%d): %sn", __FILE__, __LINE__, err_str);
assert(0);
}
}
__global__ void kernel(int datanum)
{
extern __shared__ int sh[];
// assign data for data 1
float2* data_ptr1((float2*)sh);
for (int thid = threadIdx.x; thid < datanum; thid += blockDim.x)
{
data_ptr1[thid] = make_float2(0., 0.);
}
__syncthreads();
// assign data for data 2
size_t shOffset = (sizeof(float2)/sizeof(int)*(datanum));
if(threadIdx.x == 0) printf("Offset: %dn", (int)(shOffset));
__syncthreads();
float4 *data_ptr2((float4*)&sh[shOffset]);
for (int thid = threadIdx.x; thid < datanum; thid += blockDim.x)
{
data_ptr2[thid] = make_float4(0., 0., 0., 0.);
}
__syncthreads();
}
int main()
{
int datanum = 21; // bug reports for datanum = 21, but everthing works fine for datanum = 20
int blocknum = 1;
int threadperblock = 128;
int preallocated = 768;
size_t shmem = datanum*sizeof(float2) + preallocated*sizeof(int);
printf("Allocated Shared memory byte: %d Nums: %dn", (int)shmem, (int)(shmem/sizeof(int)));
kernel<<<blocknum, threadperblock, shmem>>>(datanum);
cudaDeviceSynchronize();
cucheck_dev(cudaGetLastError());
}
- 操作系统:Ubuntu 18.02
- 库达:10.1
- 设备:RTX 2060
- g++:7.5.0
如图所示,共享内存包括两个区域,一个用于固定数据,类型为float2
。
其他区域可以将不同类型保存为int
或float4
,从共享存储器条目偏移。
当我将datanum
设置为20时,代码工作正常。
但当datanum
更改为21时,代码会报告一个未对齐的地址。
我非常感谢您的回复或建议。
谢谢!
cuda-memcheck
提供的一些信息发布在这里供参考:
========= Invalid __shared__ write of size 16
========= at 0x00000280 in kernel(int)
========= by thread (20,0,0) in block (0,0,0)
========= Address 0x000001e8 is misaligned
========= Device Frame:kernel(int) (kernel(int) : 0x280)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2efabe]
========= Host Frame:test [0x13de9]
========= Host Frame:test [0x13e77]
========= Host Frame:test [0x4a1c5]
========= Host Frame:test [0x6f32]
========= Host Frame:test [0x6df5]
========= Host Frame:test [0x6e2e]
========= Host Frame:test [0x6c14]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:test [0x69fa]
=========
========= Invalid __shared__ write of size 16
========= at 0x00000280 in kernel(int)
========= by thread (19,0,0) in block (0,0,0)
========= Address 0x000001d8 is misaligned
========= Device Frame:kernel(int) (kernel(int) : 0x280)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2efabe]
========= Host Frame:test [0x13de9]
========= Host Frame:test [0x13e77]
========= Host Frame:test [0x4a1c5]
========= Host Frame:test [0x6f32]
========= Host Frame:test [0x6df5]
========= Host Frame:test [0x6e2e]
========= Host Frame:test [0x6c14]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:test [0x69fa]
=========
========= Invalid __shared__ write of size 16
========= at 0x00000280 in kernel(int)
========= by thread (18,0,0) in block (0,0,0)
========= Address 0x000001c8 is misaligned
========= Device Frame:kernel(int) (kernel(int) : 0x280)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2efabe]
========= Host Frame:test [0x13de9]
========= Host Frame:test [0x13e77]
========= Host Frame:test [0x4a1c5]
========= Host Frame:test [0x6f32]
========= Host Frame:test [0x6df5]
========= Host Frame:test [0x6e2e]
========= Host Frame:test [0x6c14]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:test [0x69fa]
=========
========= Invalid __shared__ write of size 16
========= at 0x00000280 in kernel(int)
========= by thread (17,0,0) in block (0,0,0)
========= Address 0x000001b8 is misaligned
========= Device Frame:kernel(int) (kernel(int) : 0x280)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x34e) [0x2efabe]
========= Host Frame:test [0x13de9]
========= Host Frame:test [0x13e77]
========= Host Frame:test [0x4a1c5]
========= Host Frame:test [0x6f32]
========= Host Frame:test [0x6df5]
========= Host Frame:test [0x6e2e]
========= Host Frame:test [0x6c14]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xe7) [0x21b97]
========= Host Frame:test [0x69fa]
您的问题是float4
的对齐度高于float2
的对齐度。因此线路
size_t shOffset = (sizeof(float2)/sizeof(int)*(datanum));
float4 *data_ptr2((float4*)&sh[shOffset]);
除非datanum是偶数,否则不要保证dataptr2正确对齐。
我在这里为这个问题写了一些代码:CUDA:文档中的共享内存对齐
最简单的解决方案是只交换data_ptr1
和data_ptr2
。将存储器的正面用于具有较大对齐的类型