This is the first time, I am loading a large-scale dataset in Google Colab. For now, I have performed a one-time purchase of 100 compute units and want to train a PyTorch model. I am running into a sequence of errors, all relating to the inability of Google Colab to handle large datasets.
Dataset
Let me first outline the data I am working with.
- approx. 85,000 data points (total size: ~25 GB)
- each data point is represented by a directory (containing image file + label file)
- stored in my personal drive storage
/MyDrive/repository_name/data/primus/
How I currently handle the data
1) Mount Google Drive
from google.colab import drive
import os
gdrive_path='/content/gdrive/MyDrive/repository_name/'
drive.mount('/content/gdrive', force_remount=True)
2) Load PyTorch Dataset
For data loading, I have implemented a custom Pytorch dataset:
class PrimusDataset(data.Dataset):
def __init__(self, data_path, vocabulary_path, transform=None):
self.data_path = data_path
self.transform = transform
# list of tuples containing image and label file paths
self.data = []
# some transforms and data handling
# iterate through each subdirectory (corresponding to a sample)
for sample_dir in os.listdir(data_path):
sample_dir_path = os.path.join(data_path, sample_dir)
image_file = None
semantic_file = None
# .png-file contains image, .semantic-file contains labels
for file in os.listdir(sample_dir_path):
if file.endswith(".png"):
image_file = os.path.join(sample_dir_path, file)
elif file.endswith(".semantic"):
semantic_file = os.path.join(sample_dir_path, file)
# some exception handling
This is all you need to know to understand the exceptions that occur. I am calling the Dataset class in the following fashion:
data_path = os.path.join(gdrive_path, 'data', 'primus')
vocabulary_path = os.path.join(gdrive_path, 'data', 'semantic_labels.txt')
dataset = PrimusDataset(data_path=data_path, vocabulary_path=vocabulary_path)
Errors
OSError: [Errno 5] Input/output error: '/content/gdrive/MyDrive/repository_name/data/primus'
This error seems to be related to some quota related to I/O operations imposed by GoogleColab. However, I can overcome this error by repeatedly running the notebook cell (because of some caching behavior in Google Colab).ConnectionAbortedError: [Errno 103] Software caused connection abort: '/content/gdrive/MyDrive/repository_name/data/primus/230005372-1_2_1'
Once the above error is not occurring anymore, this new error occurs. Tracking the System RAM usage, this error always occurs right before reaching full RAM usage. I therefore have the suspicion that it is somehow related to RAM limitations.
My Question
What is a clean way to use a large dataset (i.e. many files) that is stored in Google Drive without encountering I/O errors and RAM limitations?