使用 atomicAdd 对 CUDA 上的图像求平均值会产生不一致的结果



我将图像的强度平均为一个更大问题的简单测试用例。但是每次运行它时我得到的结果都略有不同。相反,如果我在 CPU 上按顺序运行相同的算法,结果是静态的。我们来看看GPU上的代码,

//util.cu
__global__ void avgImageDevice(float3 *avg, float3 *d_colorImageRGB, unsigned int width, unsigned int height)
{
    const unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
    const unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
    if (x >= width || y >= height) return;
    atomicAdd(&avg->x, d_colorImageRGB[y*width + x].x);
    atomicAdd(&avg->y, d_colorImageRGB[y*width + x].y);
    atomicAdd(&avg->z, d_colorImageRGB[y*width + x].z);
}
extern "C" void avgImage(float3 *avg, float3 *d_colorImageRGB, unsigned int width, unsigned int height)
{
    const int T_PER_BLOCK = 16;
    const dim3 blockSize((width + T_PER_BLOCK - 1) / T_PER_BLOCK, (height + T_PER_BLOCK - 1) / T_PER_BLOCK);
    const dim3 gridSize(T_PER_BLOCK, T_PER_BLOCK);
    avgImageDevice << <blockSize, gridSize >> >(avg, d_colorImageRGB, width, height);
}

而CPU的实现如下,

//main.cpp
#include <vector_types.h>
#include <opencv2corecore.hpp>
#include <cuda_runtime.h> 
#include <string>
extern "C" void avgImage(float3 *avg, float3 *d_colorImageRGB, unsigned int width, unsigned int height);
int main()
{
    for(int k = 0 ; k < 100 ;++k)
    {
        //Initialization
        Mat Image;
        float3 avgCPU = make_float3(0, 0, 0);
        float3 avgGPU = make_float3(0, 0, 0);
        std::string filenameImage("/foo.jpg");
        Image = imread(filenameImage, -1);
        Image.convertTo(Image, CV_32FC3, 1.0f / 255);
        //Copy to GPU global memory
        cutilSafeCall(cudaMemcpy(d_albedoMapFilteredFloat3, Image.data, sizeof(float) * 3 * Image.size().width * Image.size().height, cudaMemcpyHostToDevice));
        //Average on CPU
        for (int x = 0; x < Image.size().width; ++x)
            for (int y = 0; y < Image.size().height; ++y)
            {
                Vec3f intensity = Image.at<Vec3f>(y, x);
                avgCPU += make_float3(intensity.val[0], intensity.val[1], intensity.val[2]);
            }
        avgCPU /= Image.size().width * Image.size().height;

        //Average on GPU
        float3 *d_avg;
        cutilSafeCall(cudaMalloc(&d_avg, sizeof(float3)));
        cutilSafeCall(cudaMemset(d_avg, 0, sizeof(float3)));
        avgImage(d_avg, d_albedoMapFilteredFloat3, Image.size().width, Image.size().height);
        cutilSafeCall(cudaMemcpy(&avgGPU, d_avg, sizeof(float3), cudaMemcpyDeviceToHost));
        avgGPU /= Image.size().width * Image.size().height;
        //Following values are consant across the iterations
        printf("AVG CPU r: %.10f, g: %.10f, b: %.10fn", avgCPU.x, avgCPU.y, avgCPU.z);
        //Following values are different at every iteration
        printf("AVG GPU r: %.10f, g: %.10f, b: %.10fn", avgGPU.x, avgGPU.y, avgGPU.z);
    }
}

因此,每对后续行都应该匹配,并且是静态的。但它们不匹配,GPU 结果不是静态的。

AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325752139, g: 0.6762712002, b: 0.6835504174
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325753927, g: 0.6762660146, b: 0.6835544705
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325772405, g: 0.6762678027, b: 0.6835457087
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325744987, g: 0.6762621403, b: 0.6835452914
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325761080, g: 0.6762756109, b: 0.6835403442
AVG CPU r: 0.6326226592, g: 0.6762236953, b: 0.6836426258
AVG GPU r: 0.6325756311, g: 0.6762655973, b: 0.6835408211

我有一个GTX 960,CUDA 6.5和Windows 7。这是一个数据竞争问题吗?据我所知,atomicAdd没有报告全局内存有任何问题。

这不是

一场数据竞赛。

浮点加法是可交换的:

a + b == b + a

但它不是关联的;有a,b,c这样的:

(a + b) + c != a + (b + c)

单个添加的不同顺序(特别是它们的关联方式)将给出不同的结果。

结果可能取决于线程的调度顺序。实际上,根据图像大小、组件中的值,生成的平均值可能因运行而略有不同,尽管所有值都是正确的。如果 num 与另一个运行不同,则代码的其他部分很可能存在问题。如果 num 相同,则您的所有结果都符合 IEEE-754 标准。

最新更新