I'm just starting to use Julia's CUDArt package to manage GPU computing.  I am wondering how to ensure that if I go to pull data from the gpu (e.g. using to_host()) that I don't do so before all of the necessary computations have been performed on it.
Through some experimentation, it seems that to_host(CudaArray) will lag while the particular CudaArray is being updated.  So, perhaps just using this is enough to ensure safety?  But it seems a bit chancy.  
Right now, I am using the launch() function to run my kernels, as depicted in the package documentation.  
The CUDArt documentation gives an example using Julia's @sync macro, which seems like it could be lovely.  But for the purposes of @sync I am done with my "work" and ready to move on as soon as the kernel gets launched with launch(), not once it finishes.  As far as I understand the operation of launch() - there isn't a way to change this feature (e.g. to make it wait to receive the output of the function it "launches").
How can I accomplish such synchronization?
                        
I think the more canonical way is to make a stream for each device:
streams = [(device(dev); Stream()) for dev in devlist]and then inside the
@asyncblock, after you tell it to do the computations, you use thewait(stream)function to tell it to wait for that stream to finish its computations. See the Streams example in the README.