I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be called from a device kernel. The following code was tried with CUDA 10.1 on Turing architecture.
__global__ void async_scatter_kernel(float* first,
float* last,
int* map,
float* output)
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
cudaDeviceSynchronize();
cudaStreamDestroy(stream);
}
I know thrust uses dynamic parallelism to launch its kernels when called from the device, however I couldn't find a way to specify the stream.
The following code compiles cleanly for me on CUDA 10.1.243:
The
-arch=sm_35(or similar) and-rdc=trueare necessary (but not in all cases sufficient) compile switches for any code that uses CUDA Dynamic Parallelism. If you omit, for example, the-rdc=trueswitch, you get an error similar to what you describe:So, for the example you have shown here, your compilation error can be eliminated either by updating to the latest CUDA version or by specifying the proper command line, or both.