I am interested in the differences between the constant cache and the texture cache for devices of compute capability 3.5, particularly the broadcasting behaviour. When all threads in a warps issue a request for the same data element from the constant memory and it hits in the cache, it is broadcasted to all threads in a single cycle. What is the behaviour of the texture cache in this case? Do the loads get serialised?
Also, am I correct to think that both the constant and texture cache are per multiprocessor and hence shared by multiple blocks?
NVIDIA does not provide additional details on the size or location of the constant cache.
The number of texture caches vary.
Warps in blocks will be allocated across the warp schedulers in a SM.
If all 32 threads in a warp perform an indexed constant read to the same address it will be performed in 1 instruction issue if the request hits in the cache.
If all 32 threads in a warp perform a LDG to the same address in CC3.5 texture cache the data will be requested and returned over 8 cycles.