Distributed Data Parallel (DDP) Batch size

2.7k views Asked by At

Suppose, I use 2 gpus in a DDP setting.

So, if I intend to use 16 as a batch size if I run the experiment on a single gpu,

should I give 8 as a batch size, or 16 as a batch size in case of using 2 gpus with DDP setting??

Does 16 is divided into 8 and 8 automatically?

Thank you -!

3

There are 3 answers

0
Deusy94 On BEST ANSWER

As explained here:

  • the application of the given module by splitting the input across the specified devices
  • The batch size should be larger than the number of GPUs used locally
  • each replica handles a portion of the input

If you use 16 as batch-size, it will be divided automatically between the two gpus.

1
Gabriel On

No, it won't be split automatically. When you set batch_size=8 under DDP mode, each GPU will receive dataset with batch_size=8, so the global batch_size=16

0
M Ciel On

I don't agree with Deusy94's answer.

If I understand correctly according to pytorch's official example using distributed data parallel (ddp) at line 160:

args.batch_size = int(args.batch_size / ngpus_per_node)

the batch size when you instantiate the DataLoader is the batch size for a single process/single node.

Note that in the argparser the comment was:

parser.add_argument('-b', '--batch-size', default=256, type=int,
                metavar='N',
                help='mini-batch size (default: 256), this is the total '
                     'batch size of all GPUs on the current node when '
                     'using Data Parallel or Distributed Data Parallel')

Hence, let's say you have passed --batch-size 16 here and you have two GPUs, the args.batch_size will be updated to 8 manually (divided by number of GPUs) in Line 160 above, and the actual dataloader you created is with batch_size of 8 - which is the dataloader for individual GPUs.

Therefore, if you create dataloader with DataLoader(datasetm batch_size=16), and you start the DDP with 2 GPUs, each GPU will proceed with batch_size=16 and your global batch_size will be 32.

This is different with DataParallel which has a gather/scatter procedure, such that your batch is automatically scattered into equal size of chunks for each GPUs (i.e., DataLoader(datasetm batch_size=16) --> each GPU gets 8).

Eather way, it's quite easy to verify it if you iterate the dataloader with a progress bar (e.g., tqdm) to log how many steps it needed to traverse all batches (i.e., number of batches), and you can always compute which of the equation is True: batch_size * num_batches == dataset_size or num_gpu * batch_size * num_batches == dataset_size.