Suppose a cuda gpu with just one SM.
Can I start two cuda kernels in parallel that use say 512 threads? Or are kernels assigned in blocks, causing the kernels to be executed in series rather than in parallel, thus idling the remaining 512 cores for both executions?
I tried finding documentation describing the scheduling mechanisms behind nvidia gpus, but I could only find documentation that described how a single kernel gets resources allocated to it. It did not state if two kernels could be assigned to a single block in parallel, or if resources are allocated at a warp level, or maybe even a thread level.
I expect either block level or warp level resource sharing, but I'd like to understand the actual scheduling algorithms rather than guess, or reverse engineer.
I've lightly edited the question so that it is more-or-less sensible, and does not conflate software hierarchy (blocks) with hardware hierarchy (SMs, cuda cores, etc.) The comment stream clarifies OPs acknowledgement, in my view.
Yes, even on a single SM, it is possible for the block scheduler to deposit blocks from two different kernels.
Here is an example/test case:
This is on a laptop with a Quadro K610M.
deviceQueryshows us that this GPU has a single SM.The test code shows two kernels. The first launched kernel (
k1) will spin forever until the global location (s) becomes non-zero. The second launched kernel (k2) sets the global location to non-zero, allowing the first kernel to exit. The net result is normal application termination. If the two kernels were not scheduled at the same time, on the same SM (there is only 1) then the first kernel would spin forever, the second would never run, and the observed application behavior would be a hang. (You can witness a hang for example by launching both kernels into the same stream, eg. the default NULL stream.)The fact that we do not see a hang means that both kernels ran at the same time, on the same SM.
There are a number of other questions here on the
cudaSO tag which discuss block scheduling, and GPU scheduling behavior. Here is one example. There are others.EDIT: Responding to a question in the comments, here is a test case that tries to find how many kernels can run "at once" on this single SM GPU:
The above test case is successful for 16 kernels. If I compile instead with
-DNL=32the test case hangs. Referring to table 15 in the CUDA 11.4 programming guide, there are several relevant numbers for this cc3.5 GPU:So we seem to have confirmed that if we go beyond 16 kernels, then we have exceeded the 16 blocks per SM limit, and the code will hang.