No weights improvement for pytorch regression with horovod

26 views Asked by Anjo At 22 August 2023 at 14:53

I created a model for a regression task in pytorch and want to use horovod to distribute each batch of data in the training phase. I'm currently testing it with only 1 GPU on Google Colab. The problem is, that during the training the model parameters never get changed and the loss is always the same.

This is how i get the data:

train_sampler = torch.utils.data.distributed.DistributedSampler(train_ds, num_replicas=hvd.size(), rank=hvd.rank())
train_dl_sampled = DataLoader(train_ds, batch_size=batch_size, sampler=train_sampler)

train_ds is a pandas Dataset with a column for features(a numpy array of 4860 elements) and labels(a numpy array of 8 elements).

Then I set all the important variables and functions:

device = torch.device('cuda', hvd.local_rank()) if torch.cuda.is_available() else torch.device('cpu')

hvd_model = MLP(d_in=4860, n_layers=4, d_layers=256, dropout=0.0, d_out=8, categories=None, d_embedding=128, regression=True, categorical_indicator=None)
hvd_model = hvd_model.to(device)

batch_size = 512
learning_rate = 0.01

learning_rate *= hvd.size()
loss_fn = nn.MSELoss()
optimizer = AdamW(model.parameters(), lr=learning_rate)
epochs=50

optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

Then I train it with this code:

hvd_preds = {}
for epoch in range(epochs):
  hvd_model.train()
  total_loss = 0
  for i, (features, labels) in enumerate(train_dl_sampled):
    features = features.to(device)
    labels = labels.to(device)
    
    optimizer.zero_grad()

    y_pred = hvd_model(features)

    loss = loss_fn(labels, y_pred)
    loss.backward()
    optimizer.synchronize()

    optimizer.step()

    total_loss += loss.item()

  hvd_preds[epoch] = {'pred':y_pred, 'total_loss':total_loss, 'loss':loss, 'features':features}
  avg_loss = hvd.allreduce(torch.tensor(total_loss), name='average_loss')
  print(f'Epoch {epoch+1}, Average Loss: {total_loss / len(train_dl_sampled):.4f} total loss: {total_loss:.4f}, average loss hvd: {avg_loss:.4f}')

The problem now is, that the print statement returns the same result for each epoch. Obviously different results each run.

Epoch 1, Average Loss: 0.2934 total loss: 4.9882, average loss hvd: 4.9882
Epoch 2, Average Loss: 0.2934 total loss: 4.9882, average loss hvd: 4.9882
Epoch 3, Average Loss: 0.2934 total loss: 4.9882, average loss hvd: 4.9882

Training without any horovod implementation works fine.

edit: The data used in the first step is created like this:

class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        # Get a single row from the DataFrame and convert to PyTorch tensors
        row = self.data.iloc[idx]
        features = torch.tensor(row['features'], dtype=torch.float32)
        labels = torch.tensor(row['labels'], dtype=torch.float32)
        return features, labels


train_ds = CustomDataset(train_df)
test_ds = CustomDataset(test_df)
batch_size = 512
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=False)
test_dl = DataLoader(test_ds, batch_size=batch_size, shuffle=False)

Could it be a problem that I'm using an own implementation of the Dataset and something is missing?

Original Q&A

TechQA.

No weights improvement for pytorch regression with horovod

There are 0 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in PYTORCH

Related Questions in HOROVOD

Popular Questions

Trending Questions