I created a model for a regression task in pytorch and want to use horovod to distribute each batch of data in the training phase. I'm currently testing it with only 1 GPU on Google Colab. The problem is, that during the training the model parameters never get changed and the loss is always the same.
This is how i get the data:
train_sampler = torch.utils.data.distributed.DistributedSampler(train_ds, num_replicas=hvd.size(), rank=hvd.rank())
train_dl_sampled = DataLoader(train_ds, batch_size=batch_size, sampler=train_sampler)
train_ds is a pandas Dataset with a column for features(a numpy array of 4860 elements) and labels(a numpy array of 8 elements).
Then I set all the important variables and functions:
device = torch.device('cuda', hvd.local_rank()) if torch.cuda.is_available() else torch.device('cpu')
hvd_model = MLP(d_in=4860, n_layers=4, d_layers=256, dropout=0.0, d_out=8, categories=None, d_embedding=128, regression=True, categorical_indicator=None)
hvd_model = hvd_model.to(device)
batch_size = 512
learning_rate = 0.01
learning_rate *= hvd.size()
loss_fn = nn.MSELoss()
optimizer = AdamW(model.parameters(), lr=learning_rate)
epochs=50
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
Then I train it with this code:
hvd_preds = {}
for epoch in range(epochs):
hvd_model.train()
total_loss = 0
for i, (features, labels) in enumerate(train_dl_sampled):
features = features.to(device)
labels = labels.to(device)
optimizer.zero_grad()
y_pred = hvd_model(features)
loss = loss_fn(labels, y_pred)
loss.backward()
optimizer.synchronize()
optimizer.step()
total_loss += loss.item()
hvd_preds[epoch] = {'pred':y_pred, 'total_loss':total_loss, 'loss':loss, 'features':features}
avg_loss = hvd.allreduce(torch.tensor(total_loss), name='average_loss')
print(f'Epoch {epoch+1}, Average Loss: {total_loss / len(train_dl_sampled):.4f} total loss: {total_loss:.4f}, average loss hvd: {avg_loss:.4f}')
The problem now is, that the print statement returns the same result for each epoch. Obviously different results each run.
Epoch 1, Average Loss: 0.2934 total loss: 4.9882, average loss hvd: 4.9882
Epoch 2, Average Loss: 0.2934 total loss: 4.9882, average loss hvd: 4.9882
Epoch 3, Average Loss: 0.2934 total loss: 4.9882, average loss hvd: 4.9882
Training without any horovod implementation works fine.
edit: The data used in the first step is created like this:
class CustomDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
# Get a single row from the DataFrame and convert to PyTorch tensors
row = self.data.iloc[idx]
features = torch.tensor(row['features'], dtype=torch.float32)
labels = torch.tensor(row['labels'], dtype=torch.float32)
return features, labels
train_ds = CustomDataset(train_df)
test_ds = CustomDataset(test_df)
batch_size = 512
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=False)
test_dl = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
Could it be a problem that I'm using an own implementation of the Dataset and something is missing?