In my CUDA program, every thread increments global (__device__) integer value and uses it for further calculations - every thread needs their own, unique value. I've used atomicAdd with local value
local_count = atomicAdd(&global_count, 1);
for this task which worked great until I needed to store this number as 256 bit integer rather than simple u32. I changed this global value into an array of eight u32, which needs rollover support. In CPU programming, this problem is straightforward, but in CUDA each thread might access this global value at the same time, which renders classic CPU implementation useless. I've tried with atomicInc and atomicCAS but the problem is still race conditions between threads as loading and/or incrementing upper u32 values is needed. What is the best way to tackle this problem?
One possible approach would be to use a critical section controlled by a semaphore as discussed here.
Here is a simple example based on that:
EDIT: After some additional reflection, it seems like a generalized addition operation (at least) could be constructed for a "long integer" using atomics. Here is an example:
To check the results, we note that this operation is basically the sum of integers from 0 to 1048575. That is 1048575x1048576/2 = 549,755,289,600. The code produced a result of 127x4,294,967,296 + 4,294,443,008 = 549,755,289,600.
Both approaches would require care to be taken if the values are read outside of the add/increment routine. The critical section method lends itself more easily to a "safe read" routine, if such were required. Use the same semaphore, and just read the value(s). The atomic method would be more difficult to manage in this respect.