I am trying to parallel download urls with the following:
def parallel_download_files(self, urls, filenames):
pids = []
for (url, filename) in zip(urls, filenames):
pid = os.fork()
if pid == 0:
open(filename, 'wb').write(requests.get(url).content)
else:
pids.append(pid)
for pid in pids:
os.waitpid(pid, os.WNOHANG)
But when executing with a list of urls and filenames, the computer system is building up in memory and crashing. From the documentation, I thought that the options in waitpid should be correctly handled if setting it to os.WNOHANG. This is the first time I am trying parallel with forks, I have been doing such tasks with concurrent.futures.ThreadPoolExecutor before.
Using os.fork() is far from ideal especially as you're not handling the two processes that are being created (parent/child). multithreading is far superior for this use-case.
For example:
Note:
If any filenames are duplicated in the filenames list then you'll need a more complex strategy that ensures that you never have more than one thread writing to the same file