Low level ReSTIR optimisations

Introduction

ReSTIR is meant to improve ray tracing performance through better selection of samples and the sharing of these selected samples. The current implementation I have suffers massively from how the GPU operates, high branch divergence and random memory access. Lets take a look at the structure the reuse kernels in an abstract way to begin with to identify problems.

Global optimisations

Array of Structs to Struct of Arrays

__global__ void ReSTIR_kernel(ReSTIRContext* rctx, Scene* world, curandState* rand_states) {
    // for each thread
    get idx of current pixel;

    Reservoir reservoir;

    for candidate_idx in reservoir_choices {
        candidate = load reservoir at candidate_idx;

        // check if valid
        if (candidate.M <= 0.0f || candidate.W <= 0.0f) {
            return;
        }

    reservoir.update(candidate);
    }
}

Our current Reservoir structure contains a LightSample (56 bytes), 4 floats (16 bytes), and a vec3 (12 bytes) for a total of 84 bytes (padded to 88 bytes to ensure properly aligned on transactions). This means for candidate reservoirs that are invalid we load 88 bytes from global memory only to read two 4 byte floats. This incurs an 11x read cost for the data before ‘discarding’ the rest*. To test our hypothesis of the issue lets profile the code using Nsight Compute.

(*Note: In practice this is often not the case as the compiler will optimise and only fetch these floats, however in the AoS design these will be unaligned still requiring an entire 32 byte sector read wasting 87.5% of the bandwidth per transaction. In SoA the reads are contiguous utilising 100% of the fetched cache line.)

Fig. 1 - Condensed Nsight Compute report for the current code
Function Name Duration Potential Gain Compute Throughput Memory Throughput #Registers
initial_RIS 5.9ms 3.2ms 43% 69% 114
visibility_pass 7.8ms 4.8ms 66% 88% 56
reuse_temporally 2.2ms 1.6ms 4% 75% 72
reuse_spatially 3.8ms 3.0ms 13% 95% 95
shade_pixels 8.4ms 5.1ms 60% 89% 63
get_output 0.2ms 0.1ms 10% 62% 24
set_prev 0ms 0ms 0% 0.4% 24

Looking at Fig. 1 we can see that all of the kernels have a higher memory throughput compared to compute. Extreme examples are our reuse kernels, with very low compute throughput and very high memory throughput. In simpler terms, the compute cores are idle waiting for data to be fetched so we can perform our computation on them. There is also a very high register count for the initial RIS pass meaning we cannot have enough threads running to hide memory fetches.

One way to address this is to change from an Array of Structure (AoS) to a Structure of Arrays (SoA). Meaning instead of the following code:

struct Reservoir {
    LightSample s;
    float w_sum;
    float W;
    float target_pdf;
    vec3 contribution;
    float M;
}

Reservoir* initial_samples;

We instead do the following:

struct ReservoirBuffer {
    LightSample* s;
    float* w_sum;
    float* W;
    float* target_pdf;
    vec3* contribution;
    float* M;
}

ReservoirBuffer initial_samples;

This allows us to load both M and W from a specific reservoir without incurring the penalty of loading all the other data. Another benefit is that when all the threads read M and W this operation happens simultaneously and since they are only 4 bytes each we can perform a single coalesced 128 byte memory sector transaction in one go.

Our abstract code now looks like:

__global__ void ReSTIR_kernel(ReSTIRContext* rctx, Scene* world, curandState* rand_states) {
    // for each thread

    get idx of current pixel;
	
    Reservoir reservoir;

    for candidate_idx in reservoir_choices {
        candidate_M = load M at candidate_idx;
        candidate_W = load W at candidate_idx;
		
        // check if valid
        if (candidate_M <= 0.0f || candidate_W <= 0.0f) {
            return;
        }

        // load rest of the data
        ...
		
        reservoir.update(candidate data);
    }
}

Now lets profile and see the improvement.

Fig. 2 - Condensed Nsight Compute report of the code after implementing SoA
Function Name Duration Potential Gain Compute Throughput Memory Throughput #Registers
initial_RIS 5.5ms 2.9ms 47% 60% 117
visibility_pass 7.5ms 4.6ms 68% 91% 40
reuse_temporally 1.8ms 1.3ms 6% 70% 72
reuse_spatially 2.9ms 2.2ms 17% 94% 96
shade_pixels 8.1ms 4.9ms 62% 91% 40
get_output 0.2ms 0.1ms 10% 62% 24
set_prev 0ms 0ms 0% 0.8% 30

This refactor yields a 7% increase in framerate. However, we still have an extremely high memory throughput by both reuse kernels, let us have a look at another reason.

Compacting our LightSample

Our light samples have quite a large memory footprint:

struct LightSample {
    uint32_t tri_id;
    uint32_t mat_id;

    // geometry of x2
    // path of (x0, x1, x2)
    vec3 point;
    vec3 normal;

    float2 texture_uv;

    // Light emitted
    vec3 Le;
};

Each time we load a LightSample we must load 56 bytes. One way we could improve this is by reducing the size of LightSample down and then inflate the struct when we need to use the data.

With just the triangle and material id, and the barycentric coordinates we can reconstruct the rest of the data. This takes our struct from 52 bytes (padded to 56) down to 16 bytes.

We wil get back to implementing this soon as doing so now will affect the layout of the entire project.

Individual Kernel Improvements

Let us start at the first kernel we call, the kernel responsible for creating the initial samples via RIS.

Initial RIS

__global__ void ReSTIR_initial_RIS(ReSTIRContext* rctx, Scene* world, curandState* rand_states) {
    // for each thread

    get idx into buffers

    reset ReservoirBuffers
    reset colour buffer

    create and trace primary ray
    if we didnt hit anything return
    if we hit emissive set the colour buffer and return
    

    store the hit info in the gbuffer

    // begin RIS

    create Reservoir

    for (int c = 0; c < rctx->RIS_candidates; c++) {
        sample from all emissive triangles

        update reservoir if sample is valid at the current pixel
    }
    finalise reservoir
    set reservoir in initial samples
}

Looking at this we have some major GPU code smells. First, any thread that hits a light or misses will return early and be idle while the rest of the warp performs the RIS sampling which could be hundreds of iterations. Secondly, inside each iteration the threads of a block will all choose a random sample from all possible samples, so threads will be reading from completely different* parts of the triangle array to get their data. Leading to cache thrashing.

(*Note: by sampling emissive triangles by their power instead of uniformly, threads have a higher chance of choosing the same emmisive triangle reducing the thrashing slightly.)

To address the first issue, we could split this into a wavefront ray tracing pipeline (launch a kernel for intersections, then only launch thrads for those that perform RIS), however this will add extra complexity which will interfere with other optimisations and extensions so we will only do this as a last resort.

For the second, if each thread chose only from a certain set of samples we could store all of these possible samples in the cache reducing the latency massively. This replaces, N x M independent global reads (with high latency and high variance) with N global reads + N x M shared memory broadcasts (low latency, 0 variance). We have amortised the cost of sampling from all emissive triangles across the thread block.

This approach works well with ReSTIR. This is beacuse, although blocks only sample from a small subset of possible samples, we share in the spatial reuse stage across blocks increasing the size of the possible samples.

To implement this optimisation in CUDA we will make use of shared memory, (on chip cache accessible by all threads in a single block).

__global__ void ReSTIR_initial_RIS(ReSTIRContext* rctx, Scene* world, curandState* rand_states) {
    create shared pool of possible samples 

    // for each thread

    {
        sample from all emissive triangles
        put sample into the pool
    }

    wait until all threads to finish pooling

    get idx into buffers

    reset ReservoirBuffers
    reset colour buffer

    create and trace primary ray
    if we didnt hit anything return
    if we hit emissive set the colour buffer and return
    

    store the hit info in the gbuffer

    // begin RIS

    create Reservoir

    for (int c = 0; c < rctx->RIS_candidates; c++) {
        sample from the shared pool

        update reservoir if sample is valid at the current pixel
    }
    finalise reservoir
    set reservoir in initial samples
}


Fig. 3 - Condensed Nsight Compute report of initial RIS after shared memory pool
Function Name Duration Potential Gain Compute Throughput Memory Throughput #Registers
initial_RIS 4.6ms 2.6ms 56% 59% 96

With this optimisation we drop initial ris’ run time from 5.5ms to 4.6ms a 16% decrease in runtime. Boosting FPS by another 7%. We have also reduced the registers per thread down which allows for a higher potential occupancy.




Full Initial RIS code after all the optimisations


__global__ void ReSTIR_initial_RIS(ReSTIRContext* rctx, Scene* world, curandState* rand_states) {
    __shared__ LightSample s_light_pool[THREADS_PER_BLOCK];
    __shared__ float pdfs[THREADS_PER_BLOCK];
    
    int tid = threadIdx.y * blockDim.x + threadIdx.x;
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int j = threadIdx.y + blockIdx.y * blockDim.y;

    int idx = rctx->get_idx(i, j);
    curandState local_rand_state = rand_states[idx % (rctx->width * rctx->height)];

    {
        float source_pdf;
        
        LightSample sample = world->primitives->sample_lights_power(&local_rand_state, world->materials, source_pdf);
        get_sampled_emission(sample, world->materials, world->textures);

        s_light_pool[tid] = sample;
        pdfs[tid] = source_pdf;
    }

    __syncthreads();

    if (i >= rctx->width || j >= rctx->height) return;


    (rctx->initial_samples).init(idx);
    (rctx->final_samples).init(idx);

    // reset the final colour buffer
    rctx->final_colour[idx] = vec3(0.0f);

    // jittered u, v coords that the ray will pass through
    float u = (float(i) + curand_uniform(&local_rand_state)) / float(rctx->width);
    float v = (float(j) + curand_uniform(&local_rand_state)) / float(rctx->height);

    BVH* primitives = world->primitives;

    // create ray
    ray r = (world->camera)->get_ray(u, v, &local_rand_state, rctx->is_stereoscopic);

    // trace primary ray
    hit_record rec;
    if (!primitives->hit(r, 0.0f, FLT_MAX, rec)) {
        // didnt intersect so set invalid data
        // already set colour to black
        rctx->set_gbuffers_invalid(idx);
        rctx->set_null_reservoir(idx);
        rand_states[idx] = local_rand_state;
        return;
    }

    // hit so get material of object
    Material* mat = &world->materials[rec.mat_idx];

    vec3 emission = get_emission(mat, world->textures, rec.texture_uv);
    if (max_component(emission) > 0.0f) {
        // hit an emitter so set data to invalid
        rctx->set_gbuffers_invalid(idx);
        rctx->set_null_reservoir(idx);
        // write the colour to final colour buffer
        rctx->final_colour[idx] = emission;
        rand_states[idx] = local_rand_state;
        return;
    }

    // write into gbuffer
    rctx->g_point[idx] = rec.p;
    rctx->g_norm[idx] = rec.shading_normal;
    rctx->g_mat_ids[idx] = rec.mat_idx;
    rctx->g_uv[idx] = rec.texture_uv;

    // reservoir RIS begins
    Reservoir res;
    res.init();

    for (int c = 0; c < rctx->RIS_candidates; c++) {
        float source_pdf;

        uint32_t light_idx = curand(&local_rand_state) % THREADS_PER_BLOCK;

        LightSample sample = s_light_pool[light_idx];
        source_pdf = pdfs[light_idx];

        // calculate how much light could be received if unoccluded
        vec3 possible_contribution;
        // target pdf is the luminance of the unoccluded light
        float target_pdf;

        if (calculate_target_pdf(rctx, world, rec, *mat, sample, &local_rand_state, possible_contribution, target_pdf)) {
            float w = target_pdf / source_pdf;
            res.update(sample, w, target_pdf, possible_contribution, &local_rand_state);
        }
    }
    res.M = (float)rctx->RIS_candidates;
    res.finalise_W();

    rctx->initial_samples.set_data(idx, res);
    rand_states[idx] = local_rand_state;
}