I need to make a warp shuffling that look like this:

On this picture, the number of threads is limited to 8 to make it readable.
If I read the Nvidia SDK and ptx manual, the shuffle instruction should do the job, specially the shfl.idx.b32 d[|p], a, b, c; ptx instruction.
From the manual I read:
Each thread in the currently executing warp will compute a source lane
index j based on input operands b and c and the mode. If the computed
source lane index j is in range, the thread will copy the input operand
a from lane j into its own destination register d;
So, providing proper values of b and c, I should be able to do it by writing a function like this (inspired from CUDA SDK __shufl primitive implementation).
__forceinline__ __device __ float shuffle(float var){
float ret;
int srcLane = ???
int c = ???
asm volatile ("shfl.idx.b32 %0, %1, %2, %3;" : "=f"(ret) : "f"(var), "r"(srcLane), "r"(c));
return ret;
}
If it is possible, what is the constant for srcLane and c? I am not able to determine them (I am using CUDA 8.0) .
Best,
Timocafe
I would recommend doing this with the CUDA intrinsic rather than with PTX (or inline ASM). However the following code demonstrates both methods:
Using the CUDA intrinsic (the first method) the only real task is to compute the source lane index. Based on your pattern I wrote some code to do that and put it in the variable
j.