Optimizing an OpenCL Kernel

166 views Asked by At

I'm trying to optimize the following Kernel using OpenCL. Basically I'm doing triangle rasterization using a frame buffer of 800*800 pixels, and checking the bounding box of each triangle and if the current pixel that is scanned line is inside the box of the triangle then I rasterize it.

here is the kernel and how do I invoke it:

  global_size[0] = 800;
  global_size[1] = 800;
  

  auto time_start = std::chrono::high_resolution_clock::now();
  CL_CHECK(clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, global_size, NULL, 0, NULL, NULL));
  CL_CHECK(clFinish(commandQueue));

And the kernel:

size_t px = get_global_id(0); // triCount
size_t py = get_global_id(1); // triCount
int width = 800;
int height = 800;

float3 v0Raster = (float3)(triangles[px].v[0].pos[0], triangles[px].v[0].pos[1], triangles[px].v[0].pos[2]);
float3 v1Raster = (float3)(triangles[px].v[1].pos[0], triangles[px].v[1].pos[1], triangles[px].v[1].pos[2]);
float3 v2Raster = (float3)(triangles[px].v[2].pos[0], triangles[px].v[2].pos[1], triangles[px].v[2].pos[2]);
float xmin = min3(v0Raster.x, v1Raster.x, v2Raster.x);
float ymin = min3(v0Raster.y, v1Raster.y, v2Raster.y);
float xmax = max3(v0Raster.x, v1Raster.x, v2Raster.x);
float ymax = max3(v0Raster.y, v1Raster.y, v2Raster.y);
float slope = (ymax - ymin) / (xmax - xmin);
int dp, y;
bool discard_;
float ratio;
for (int x = round(xmin); x <= round(xmax); x++) {
    y = slope * (x - round(xmin) + ymin);
    ratio = (x - round(xmin) / (round(xmax) - round(xmin)));
    discard_ = false;

    int flatIdx = width - x + (height - y) * width;
    if (y < 0 || y > height || x < 0 || x > width) {
        discard_ = true;
    }
    if (!discard_) {


        fragments[flatIdx].col[0] = 1.0f;
        fragments[flatIdx].col[1] = 0;
        fragments[flatIdx].col[2] = 0;

    }
}

For a cube it takes one second to render it, which is so slow..

1

There are 1 answers

11
Quimby On

I see couple of weird things in the kernel.

You are launching a thread for each pixel but in each thread you pick a single triangle (confusingly based on pixel coordinates) and rasterize it whole in the two nested for-loops.

  • That results in many idle threads if number of triangles is less than the pixels. Or if there are more, some will not be used. That is an error surely.
  • In each "pixel thread", you rasterize a whole triangle, again that seems like an error.
  • Access to dest_buffer is thus very contested and will contain many overwrites if triangles overlap and the memory access will be essentially random, that's the worst case scenario.

Even if you implemented per-triangle kernel correctly, it's not a good idea since the runtime of the thread will correspond to size of the triangle. That is bad since it will unevenly distribute work, furthermore the memory access will be terrible.

Much better and simpler solution IMHO is:

  1. Launch thread per pixel/fragment.
  2. In each thread go through all triangles in a for loop in the same order
  3. Test whether the current pixel is in the triangle, if so compute its depth and color.
  4. At the end, color the pixel based on the closest triangle found. That is easily done with a local variable which keeps min_depth and matching min_color.

This is ideal for GPU because:

  • The job is parallelized over large amount of threads dues to many pixels,
  • Each thread executes exactly the same instructions, linear in number of triangles.
  • (3) can be implemented very easily as point-triangle test, see this answer which can be basically copied to the kernel code. It optionally works for 3D by applying the operations also to .z component.
  • threads in a local group will process neighboring pixels and thus likely the same triangles and hitting the same branches. Although it might be possible to implement this in branchless way.
__kernel void fragment_shader(__global float3* fragments, __global struct Triangle_* triangles, int triCount)
{
    size_t px = get_global_id(0);
    size_t py = get_global_id(1);

    for (int index = 0; index < triCount; index++)
    { 
        float3 v0Raster = (float3)(triangles[index].v[0].pos[0], triangles[index].v[0].pos[1], triangles[index].v[0].pos[2]);
        float3 v1Raster = (float3)(triangles[index].v[1].pos[0], triangles[index].v[1].pos[1], triangles[index].v[1].pos[2]);
        float3 v2Raster = (float3)(triangles[index].v[2].pos[0], triangles[index].v[2].pos[1], triangles[index].v[2].pos[2]);
        float3 p = (float3)((float)px,(float)py, 0.0f);
        if (PointInTriangle(p, v0Raster, v1Raster, v2Raster))
        {
             fragments[px+get_global_size(0)*py]=triangles[index].color;
        }
    }
}

and launch it with the code you have - global size matching the fragment buffer, local size is irrelevant.

I would recommend first dropping all the lighting stuff, just focus on minimal working example e.g. for colors. Make that fast, only then start adding more stuff.

Do note the code above will not render the triangles correctly depth-wise but instead set the fragment to the last triangle regardless of its depth.