Warp shuffling for CUDA

Question

Warp shuffling for CUDA

3.8k views Asked by Timocafé At 09 March 2018 at 16:19

I need to make a warp shuffling that look like this:

On this picture, the number of threads is limited to 8 to make it readable. If I read the Nvidia SDK and ptx manual, the shuffle instruction should do the job, specially the shfl.idx.b32 d[|p], a, b, c; ptx instruction.

From the manual I read:

Each thread in the currently executing warp will compute a source lane
index j based on input operands b and c and the mode. If the computed
source lane index j is in range, the thread will copy the input operand
a from lane j into its own destination register d;

So, providing proper values of b and c, I should be able to do it by writing a function like this (inspired from CUDA SDK __shufl primitive implementation).

  __forceinline__ __device __ float shuffle(float var){
   float ret;
   int srcLane = ???
   int c = ???
   asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
  return ret;

}

If it is possible, what is the constant for srcLane and c? I am not able to determine them (I am using CUDA 8.0) .

Best,

Timocafe

Original Q&A

There are 3 answers

Alex Sherman On 11 March 2018 at 17:08

What you are trying to do in your shuffle operation is to be able to have dynamically index source lanes on which shuffle operates. One needs to understand that any variation of shuffle command (__shfl, __shfl_up, __shfl_down, __shfl_xor) needs a constant value for its second parameter and this parameter is the same for all lanes within a warp. You can play with grouping of threads within a warp by specifying width. Thus, for example, by specifying

float var = ...
__shfl_xor(var, 3, 4);

the lane permutation will look like:

0 1 2 3
   |
3 2 1 0

So, to answer your question, it's not possible to do it with a single __shuffle operation of any kind. But you can implement it by combining several __shuffle commands with different second parameters.

Vitality On 15 March 2018 at 18:03

Robert has already and satisfactorily answered this question. I had implemented the code below, showing permutation of a full warp.

#include <stdio.h>

/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) { getchar(); exit(code); }
    }
}

__global__ void shufflingKernel(double *d_data, double *d_result, int *d_perm){

    unsigned mask = __activemask(); 
    int tid = threadIdx.x;
    int srcLane = d_perm[tid];
    double var = d_data[tid];
    //d_result[tid] = __shfl_sync(0xFFFFFFFF, var, srcLane);
    d_result[tid] = __shfl_sync(mask, var, srcLane);
}

int main(){

    const int N = 32;

    double h_data[32] = { 3.4, 42.2, 2., -1., 10., 11., 2., -1., 10., 33., 2.3, 11., 44., 0., -33., -21.,
        4.4, 43.2, 3., -2., 13., 15., 222., -90., 17., 30., 11.3, 7., 22., 100., -30., -91. };
    double *h_result = (double *)malloc(N * sizeof(double));
    int h_perm[32] = { 6, 11, 9, 2, 5, 23, 31, 0, 3, 27, 29, 1, 28, 30, 17, 13, 10, 8, 4, 22, 7, 18, 24, 12, 20,
        19, 16, 26, 21, 15, 25, 14 };

    int *d_perm; gpuErrchk(cudaMalloc(&d_perm, N * sizeof(int)));
    double *d_data; gpuErrchk(cudaMalloc(&d_data, N * sizeof(double)));
    double *d_result; gpuErrchk(cudaMalloc(&d_result, N * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_perm, &h_perm[0], N * sizeof(int), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_data, &h_data[0], N * sizeof(double), cudaMemcpyHostToDevice));

    shufflingKernel << <1, 32>> >(d_data, d_result, d_perm);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(h_result, d_result, N * sizeof(double), cudaMemcpyDeviceToHost));

    for (int k = 0; k < N; k++) {
        printf("k = %d; Original = %f; New = %f; Check = %f\n", k, h_data[k], h_result[k], h_data[h_perm[k]]);
    }

}

Notice that, instead of using 0xFFFFFFFF for the mask of active threads, it is safer using the warp-level primitive __activemask() in the sense of Shuffle instruction in CUDA not working.

**Robert Crovella** · Accepted Answer · 2018-03-10T03:32:52+00:00

I would recommend doing this with the CUDA intrinsic rather than with PTX (or inline ASM). However the following code demonstrates both methods:

// cat t54.cu
#include <stdio.h>

__global__ void k(){
    int i = threadIdx.x;
    int j = i;
    if (i<4) j*=2;
    if ((i>3) && (i<8)) j-=(7-i);
    int k = __shfl_sync(0x0FFU, i+100, j);
    printf("lane: %d, result: %d\n", i, k);
}

__forceinline__ __device__ float shuffle(float var, int lane){
    float ret;
    int srcLane = lane;
    int c = 0x1F;
    asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
    return ret;
}

__global__ void k1(){
    int i = threadIdx.x;
    int j = i;
    if (i<4) j*=2;
    if ((i>3) && (i<8)) j-=(7-i);
    float k = shuffle((float)(i+100), j);
    printf("lane: %d, result: %f\n", i, k);
}


int main(){
    k<<<1,8>>>();
    cudaDeviceSynchronize();
    k1<<<1,8>>>();
    cudaDeviceSynchronize();
}

$ nvcc -arch=sm_35 -o t54 t54.cu
$ cuda-memcheck ./t54
========= CUDA-MEMCHECK
lane: 0, result: 100
lane: 1, result: 102
lane: 2, result: 104
lane: 3, result: 106
lane: 4, result: 101
lane: 5, result: 103
lane: 6, result: 105
lane: 7, result: 107
lane: 0, result: 100.000000
lane: 1, result: 102.000000
lane: 2, result: 104.000000
lane: 3, result: 106.000000
lane: 4, result: 101.000000
lane: 5, result: 103.000000
lane: 6, result: 105.000000
lane: 7, result: 107.000000
========= ERROR SUMMARY: 0 errors
$

Using the CUDA intrinsic (the first method) the only real task is to compute the source lane index. Based on your pattern I wrote some code to do that and put it in the variable j.

TechQA.

Warp shuffling for CUDA

There are 3 answers

Related Questions in CUDA

Related Questions in SHUFFLE

Related Questions in PTX

Related Questions in GPU-WARP

Popular Questions

Trending Questions