TF + Horovod : How does batching work in a distributed system?

27 views Asked by Liam F-A At 11 October 2023 at 14:48

I'm writing a code for a CNN in Tensorflow, that I'm trying to parallelize on multiple nodes and multiple GPUs using Horovod.

To generate the image batches, I have :

train_generator = ImageDataGenerator(rescale = 1./255,
                                     preprocessing_function=preprocess_input)

train_iterator = train_generator.flow_from_directory(train_loc,
                                                     seed=SEED,
                                                     target_size=(224, 224))

In Python, because of the GIL, to parallelize code, each node/thread must execute copies of the exact same code, right ? So, doesn't that mean that each node will produce the same dataset iterator? And in that case, won't each GPU be receiving the exact same batch of images?

I'm sure I'm wrong in saying that all the GPUs are training on the same batches, but I just dont see how Horovod handles distributing the batches.

Just like there's hvd.DistributedOptimizer to parallelize backprop, what is in charge of distributing images when we call :

 model.fit(train_iterator,
           epochs=epochs, 
           callbacks=callbacks )

Thanks.

Liam

Original Q&A

TechQA.

TF + Horovod : How does batching work in a distributed system?

There are 0 answers

Related Questions in TENSORFLOW

Related Questions in DISTRIBUTED-COMPUTING

Related Questions in DISTRIBUTED

Related Questions in DISTRIBUTED-SYSTEM

Related Questions in HOROVOD

Popular Questions

Trending Questions