Handling large PyTorch datasets in Google Colab

36 views Asked by At

This is the first time, I am loading a large-scale dataset in Google Colab. For now, I have performed a one-time purchase of 100 compute units and want to train a PyTorch model. I am running into a sequence of errors, all relating to the inability of Google Colab to handle large datasets.

Dataset

Let me first outline the data I am working with.

  • approx. 85,000 data points (total size: ~25 GB)
  • each data point is represented by a directory (containing image file + label file)
  • stored in my personal drive storage /MyDrive/repository_name/data/primus/

How I currently handle the data

1) Mount Google Drive

from google.colab import drive
import os

gdrive_path='/content/gdrive/MyDrive/repository_name/'
drive.mount('/content/gdrive', force_remount=True)

2) Load PyTorch Dataset

For data loading, I have implemented a custom Pytorch dataset:

class PrimusDataset(data.Dataset):
    def __init__(self, data_path, vocabulary_path, transform=None):
        self.data_path = data_path
        self.transform = transform

        # list of tuples containing image and label file paths
        self.data = []

        # some transforms and data handling

        # iterate through each subdirectory (corresponding to a sample)
        for sample_dir in os.listdir(data_path):

            sample_dir_path = os.path.join(data_path, sample_dir)

            image_file = None
            semantic_file = None

            # .png-file contains image, .semantic-file contains labels
            for file in os.listdir(sample_dir_path):
                if file.endswith(".png"):
                    image_file = os.path.join(sample_dir_path, file)
                elif file.endswith(".semantic"):
                    semantic_file = os.path.join(sample_dir_path, file)

            # some exception handling

This is all you need to know to understand the exceptions that occur. I am calling the Dataset class in the following fashion:

data_path = os.path.join(gdrive_path, 'data', 'primus')
vocabulary_path = os.path.join(gdrive_path, 'data', 'semantic_labels.txt')
dataset = PrimusDataset(data_path=data_path, vocabulary_path=vocabulary_path)

Errors

  1. OSError: [Errno 5] Input/output error: '/content/gdrive/MyDrive/repository_name/data/primus'
    This error seems to be related to some quota related to I/O operations imposed by GoogleColab. However, I can overcome this error by repeatedly running the notebook cell (because of some caching behavior in Google Colab).
  2. ConnectionAbortedError: [Errno 103] Software caused connection abort: '/content/gdrive/MyDrive/repository_name/data/primus/230005372-1_2_1'
    Once the above error is not occurring anymore, this new error occurs. Tracking the System RAM usage, this error always occurs right before reaching full RAM usage. I therefore have the suspicion that it is somehow related to RAM limitations.

My Question

What is a clean way to use a large dataset (i.e. many files) that is stored in Google Drive without encountering I/O errors and RAM limitations?

0

There are 0 answers