How to use balanced sampler for torch Dataset/Dataloader

48 views Asked by At

My simplified Dataset looks like:

class MyDataset(Dataset):
    def __init__(self) -> None:
        super().__init__()
        self.images: torch.Tensor[n, w, h, c]   # n images in memmory - specific use case
        self.labels: torch.Tensor[n, w, h, c]   # n images in memmory - specific use case
        self.positive_idx: List                 # positive 1 out of 10000 negative
        self.negative_idx: List
        
    def __len__(self):
        return 10000 # fixed value for training
        
    def __getitem__(self, idx):
        return self.images[idx], self.labels[idx]
    

ds = MyDataset()
dl = DataLoader(ds, batch_size=100, shuffle=False, sampler=...)   
# Weighted Sampler? Shuffle False because I guess the sampler should process shuffling.

What is the most "torch" way of balancing the sampling for Dataloader so the batch will be constructed as 10 positive + 90 random negative in each epoch and in case of not enough positive duplicating the possible ones?

For the purpose of this exercise I'm not implementing augmenting for increasing sample size of positives.

1

There are 1 answers

0
jupyter On

I think you can implement a Batch Sampler to choose which data point will be yield for your dataset via __getitem__

class NegativeSampler:

  def __init__(self, positive_idx, negative_idx):
     
    self.positive_idx = positive_idx
    self.negative_idx = negative_idx 

  def __iter__(self): # this function will return index for your custom dataset ```__getitem__(self, idx)```
    
    for i in range(n_batch):
      positive_idx_batch = random.sample(self.positive_idx, batch_size)
      negative_idx_batch = []

      for pos_idx in positive_idx_batch:
        negative_idx_batch.append()
    
    
      yield positive_idx_batch + negative_idx_batch