PyTorch, validation step is considerably faster if I train on the validation data, why?

58 views Asked by At

I am training a FCN model, I have two dataloaders train_loader and val_loader. As you can see in the code below, I made the model train on the validation data. I did this to debug a problem I had where switching between the two dataloaders would case the iteration time to increase tenfold from the first loop. I obviously can't train the model on the validation data, but why does it work like this?

The dataset is loaded in another class as a ConcatDataset which merges several ImageFolders, and made into dataloaders with a batch_size = 32, num_workers = os.cpu_count(), persistent_workers = True, pin_memory = True

This is my code:

if __name__ == "__main__":
    from multiprocessing import freeze_support
    freeze_support()
    device = "cuda" if torch.cuda.is_available() else "cpu"

    model = FCN_resnet50().to(device)

    loss_fn = torch.nn.MSELoss().to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)


    datasets = universal_fake_detect.Datasets("<Path to data>", (0.8, 0.1, 0.1))
    train_loader = datasets.training()
    val_loader = datasets.validation()

    # tb_writer = SummaryWriter()

    best_eval_loss = float("inf")

    for epoch_index in range(10):

        model.train(True)
        running_loss = 0.0
        train_loss = 0.0

        for i, (inp, lab) in enumerate(train_loader):
            print("Training iteration:", i)
            lab = lab.view(-1, 1, 1, 1).expand(-1, 1, 224, 224).float()
            inputs, labels = inp.to(device), lab.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if i % 1000 == 9:
                print("  batch {} loss: {}".format(i + 1, running_loss/(i+1)))
                tb_x = epoch_index * len(train_loader) + i + 1

        train_loss = running_loss / len(train_loader)
        print("Training loss", train_loss)


        # Evaluation
        model.eval()
        running_loss = 0.0

        for i, (inp, lab) in enumerate(val_loader):
            print("val iteration:", i)
            lab = lab.view(-1, 1, 1, 1).expand(-1, 1, 224, 224).float()
            inputs, labels = inp.to(device), lab.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        avg_val_loss = running_loss / len(val_loader)
        print("Validation loss:", avg_val_loss)
        if avg_val_loss < best_eval_loss:
            best_eval_loss = avg_val_loss
            torch.save(model.state_dict(), "../../model/FCN_test_model.pth")


    print("finished training")

I have tried a lot of different settings on the parameters for the dataloaders, but nothing seems to help there.

1

There are 1 answers

0
Klops On

I see two causes for your observation:

(1) If you are using image augmentation, it is usually only active for training data, not for validation data. If this is the case, it is likely that your data loaders are too slow during training (as they perform additional augmentation) compared to validation loaders (skipping augmentation). This would make training on the validation set faster.

If possible, increase your workers so that your training loader becomes faster. Check your GPU utilization (nvidia-smi command). If utilization is below 90%, the loaders are too slow (or too few) for your GPU when augmentation slows the workers down.

(2) To state the obvious: The time needed for 1 epoch is 10 times shorter, if the validation data is only a tenth of the training data(!)