提高OPENCL图像处理的速度



我认为内核的执行时间太高。工作是只需使用加法,减法,除法或乘法将两个图像混合在一起。

#define SETUP_KERNEL(name, operator)
 __kernel void name(__read_only image2d_t image1,
                        __read_only image2d_t image2,
                       __write_only image2d_t output,
                       const sampler_t sampler1,
                       const sampler_t sampler2,
                       float opacity)
{
    int2 xy = (int2) (get_global_id(0), get_global_id(1));
    float2 normalizedCoords = convert_float2(xy) / (float2) (get_image_width(output), get_image_height(output));
    float4 pixel1 = read_imagef(image1, sampler1, normalizedCoords);
    float4 pixel2 = read_imagef(image2, sampler2, normalizedCoords);
    write_imagef(output, xy, (pixel1 * opacity) operator pixel2);
}
SETUP_KERNEL(div, /)
SETUP_KERNEL(add, +)
SETUP_KERNEL(mult, *)
SETUP_KERNEL(sub, -)

您可以看到,我使用宏来快速定义不同的内核。(我应该更好地使用功能吗?)内核以某种方式设法在GTX 970上取3毫秒。我该怎么做才能提高此特定内核的性能?我应该将其分为不同的程序吗?

双线性插值比最近的邻居速度慢2x-3x。您确定您在OpenGL中不使用最近的邻居吗?

它在后台(通过采样器)的作用是:

R1 = ((x2 – x)/(x2 – x1))*Q11 + ((x – x1)/(x2 – x1))*Q21
R2 = ((x2 – x)/(x2 – x1))*Q12 + ((x – x1)/(x2 – x1))*Q22
After the two R values are calculated, the value of P can finally be calculated by a weighted average of R1 and R2.
P = ((y2 – y)/(y2 – y1))*R1 + ((y – y1)/(y2 – y1))*R2
The calculation will have to be repeated for the red, green, blue, and optionally the alpha component of.

http://supercomputingblog.com/graphics/coding-binear-interpolation/


或仅是NVIDIA实现了openGL的快速路径,并为OpenCL Image访问提供了完整的路径。例如,对于AMD,图像写入是完整的路径,小于32位数据访问是完整的路径,图像读取为快速路径。


另一个选项:z-订单更适合计算这些图像数据的差异,而Opencl的非Z订单(可疑,也许不是)更糟。

除法通常是付费我建议将normalizedCoords的计算移至主机侧。

在主机侧:

float normalized_x[output_width]; // initialize with [0..output_width-1]/output_width
float normalized_y[output_height]; // initialize with [0..output_height-1]/output_height

将内核更改为:

#define SETUP_KERNEL(name, operator)
 __kernel void name(__read_only image2d_t image1,
                        __read_only image2d_t image2,
                       __write_only image2d_t output,
                       global float *normalized_x, 
                       global float *normalized_y, 
                       const sampler_t sampler1,
                       const sampler_t sampler2,
                       float opacity)
{
    int2 xy = (int2) (get_global_id(0), get_global_id(1));
    float2 normalizedCoords = (float2) (normalized_x[xy.x],normalized_y[xy.y] );
    float4 pixel1 = read_imagef(image1, sampler1, normalizedCoords);
    float4 pixel2 = read_imagef(image2, sampler2, normalizedCoords);
    write_imagef(output, xy, (pixel1 * opacity) operator pixel2);
}

您也可以尝试使用相同的技术尝试使用归一化的涂鸦。如果输入图像的大小不经常变化,这将更加有益。

最新更新