I'm working on a library function that uses concurrent.futures to spread network I/O across multiple threads. Due to the Python GIL I'm experiencing a slowdown on some workloads (large files), so I want to switch to multiple processes. However, multiple processes will also be less than ideal for some other workloads (many small files). I'd like to split the difference and have multiple processes, each with their own thread pool.
The problem is job queueing - concurrent.futures doesn't seem to be set up to queue jobs properly for multiple processes that each can handle multiple jobs at once. While breaking up the job list into chunks ahead of time is an option, it would work much more smoothly if jobs flowed to each process asynchronously as their individual threads completed a task.
How can I efficiently queue jobs across multiple processes and threads using this or a similar API? Aside from writing my own executor, is there any obvious solution I'm overlooking? Or is there any prior art for a mixed process/thread executor?
If I understand what you are trying to do you basically have a lot of jobs that are suitable for multithreading except that there is some CPU-intensive work. So your idea is to create multiple threading pools in multiple child processes so that there is less GIL contention. Of course, in a any given child process the CPU-intensive code will only be executed serially (assuming it is Python byte code), so it's not a perfect solution.
One approach is to just create a very large multiprocessing pool (larger than the number of cores you have). There is a limit to how may processes you can create and their creation is expensive. But since most of the time they will be waiting for I/O to complete the I/O portion will execute concurrently.
A better approach would be to create a multiprocessing pool whose executor can be passed to a multithreading pool worker function along with the other required arguments. This is an inversion of what you were planning to do. When the worker function has a CPU-intensive work to perform, it can submit that work to the passed multiprocessing pool executor and block for the returned result. In that way you get the optimal parallelism you can achieve given the number of cores you have. This would be my recommendation.. For example:
Prints:
But if you wanted to go along with your original idea or for some reason the above framework does not fit your actual situation, perhaps something like the following might work:
Prints: