重载cuda内核函数

我在CUDA中使用重载内核函数时遇到了一个问题。

我可以理解CUDA可以通过它的参数启动一个重载函数。

但是，如果我想使用cudaOccupancyMaxPotentialBlockSize()来计算最大占用的块大小，请参阅doc.

__global__ void foo_cuda_kernel(int a)
{
/*implementation 1*/
}
//overloaded kernel function
__global__ void foo_cuda_kernel(int a, int b)
{
/*implementation 2*/
}
void foo_cuda()
{
int min_grid_size, grid_size, block_size;
cudaOccupancyMaxPotentialBlockSize
(
&min_grid_size, &block_size, 
foo_cuda_kernel, //how does it distinguish overloaded functions?
0, thread_num
);
grid_size = (thread_num + block_size - 1) / block_size;

//I can understand compiler can distinguish the launched function by its arguments
foo_cuda_kernel<<<grid_size, block_size>>>((int)1);
cudaDeviceSynchronize();
}

如何使它工作?cudaOccupancyMaxPotentialBlockSize()如何区分重载函数?

如注释中所述，可以将函数强制转换为指向正确专门化的指针:

auto foo_ii = static_cast<void (*)(int, int)>(&foo_cuda_kernel);
auto foo_i = static_cast<void (*)(int)>(&foo_cuda_kernel);

然后将foo_i或foo_ii传递给cudaOccupancyMaxPotentialBlockSize，这取决于您需要的函数的哪个版本。

这将起作用，因为工具链会无声地发出宿主样板函数，这些函数包装了底层运行时API调用，以运行内核并强制执行内核参数类型检查。主机编译器将这些包装器视为任何其他主机函数(因为它们是)，并自动选择匹配的版本。

相关内容

最新更新

热门标签：