使用 OpenCL 进行快速光栅化

我正在编写一个光栅器，用于使用 OpenCL 进行实时 3D 渲染。

我目前的架构：

顶点

着色器：每个顶点 1 个线程
光栅器：每个面 1 个线程，循环覆盖面部覆盖的所有像素
片段着色器：每个像素 1 个线程

当人脸占用较小的屏幕空间时，这很有效，但是当我有一个覆盖屏幕大部分空间时，帧速率会下降，因为光栅化线程必须同步遍历面部覆盖的所有像素。

我认为这可以通过平铺方法来解决。屏幕将分为子部分(磁贴(，每个磁贴将启动一个线程。仅处理其边界框与磁贴重叠的面。

不过，我对此方法有一些疑问：

我应该找到磁贴的重叠面是 CPU 还是 GPU？
应该使用什么数据结构来存储人脸列表？它们将具有可变长度，但我相信 OpenCL 缓冲区是固定长度的。

当前实现的主机代码示例：

// set up vertex shader args
queue.enqueueNDRangeKernel(vertexShader, cl::NullRange, numVerts, cl::NullRange);
// set up rasterizer args
queue.enqueueNDRangeKernel(rasterizer, cl::NullRange, numFaces, cl::NullRange);
// set up fragment shader args
queue.enqueueNDRangeKernel(fragmentShader, cl::NullRange, numPixels, cl::NullRange);
// read frame buffer to draw to screen
queue.enqueueReadBuffer(buffer_screen, CL_TRUE, 0, width * height * 3 * sizeof(unsigned char), screen);

光栅器内核示例：

float2 bboxmin = (float2)(INFINITY,INFINITY);
float2 bboxmax = (float2)(-INFINITY,-INFINITY);
float2 clampCoords = (float2)(width-1, height-1);
// get bounding box
for (int i=0; i<3; i++) {
for (int j=0; j<2; j++) {
bboxmin[j] = max(0.f, min(bboxmin[j], vs[i][j]));
bboxmax[j] = min(clampCoords[j], max(bboxmax[j], vs[i][j]));
}
}
// loop over all pixels in bounding box
// this is the part that needs to be improved
int2 pix;
for (pix.x=bboxmin.x; pix.x<=bboxmax.x; pix.x++) {
for (pix.y=bboxmin.y; pix.y<=bboxmax.y; pix.y++) {
float3 bc_screen  = barycentric(vs[0].xy, vs[1].xy, vs[2].xy, (float2)(pix.x,pix.y), offset);
float3 bc_clip    = (float3)(bc_screen.x/vsVP[0][3], bc_screen.y/vsVP[1][3], bc_screen.z/vsVP[2][3]);
bc_clip = bc_clip/(bc_clip.x+bc_clip.y+bc_clip.z);
float frag_depth = dot(homoZs, bc_clip);
int pixInd = pix.x+pix.y*width;
if (bc_screen.x<0 || bc_screen.y<0 || bc_screen.z<0 || zbuffer[pixInd]>frag_depth) continue;
zbuffer[pixInd] = frag_depth;
}
}

解决方法是在人脸太大并返回时取消光栅化。这将导致一些视觉伪影，但至少帧速率不会受到影响。

相关内容

最新更新

热门标签：