A CUDA stream is a queue of tasks: memory copies, event firing, event waits, kernel launches, callbacks...
But - these queues don't have infinite capacity. In fact, empirically, I find that this limit is not super-high, e.g. in the thousands, not millions.
My questions:
- Is the size/capacity of a CUDA stream fixed in terms of any kind of enqueued items, or does the capacity behave differently based on what kind of actions/tasks you enqueue?
- How can I determine this capacity other than enqueuing more and more stuff until I can no longer fit any?
The "capacity" behaves differently based on actions/tasks you enqueue.
Here is a demonstration:
If we enqueue a single host function/callback in the midst of a number of kernel calls, on a Tesla V100 on CUDA 11.4 I observe a "capacity" for ~1000 enqueued items. However if I alternate kernel calls and host functions, I observe a capacity for ~100 enqueued items.
(the code hangs when the queue becomes "full", and I terminate at that point with ctrl-C)
Currently, there is no specification for this in CUDA, nor any explicit method to query for this at runtime.