Chapter 5: Data Handling with torch.utils.data with PyTorch

Abstract:

PyTorch's torch.utils.data module provides essential tools for efficient and organized data handling, primarily through the Dataset and DataLoader classes. These abstractions streamline the process of loading, preprocessing, and feeding data into a model, especially for large or complex datasets. 
1. torch.utils.data.Dataset:
  • Purpose: This is an abstract class that represents a dataset. You typically create a custom dataset by subclassing Dataset and implementing two key methods:
    • __len__(self): Returns the total number of samples in the dataset.
    • __getitem__(self, idx): Retrieves a single sample and its corresponding label (or other target information) at the given index idx. This is where you would load data from disk, apply transformations, and prepare it for your model.
  • Example:
Python
    import torch    from torch.utils.data import Dataset    class CustomImageDataset(Dataset):        def __init__(self, image_paths, labels, transform=None):            self.image_paths = image_paths            self.labels = labels            self.transform = transform        def __len__(self):            return len(self.image_paths)        def __getitem__(self, idx):            # Load image from self.image_paths[idx]            # Apply self.transform if provided            # Return image and self.labels[idx]            pass # Placeholder for actual implementation
2. torch.utils.data.DataLoader:
  • Purpose: This class wraps an iterable around a Dataset to enable efficient batching, shuffling, and multi-threaded data loading. It acts as a "delivery truck" that fetches data from your Dataset and prepares it for your model during training or evaluation.
  • Key Arguments:
    • dataset: An instance of your Dataset (or a built-in PyTorch dataset).
    • batch_size: The number of samples to load in each batch.
    • shuffle: Boolean indicating whether to shuffle the data order at the beginning of each epoch.
    • num_workers: Number of subprocesses to use for data loading. This can significantly speed up data loading, especially with large datasets and complex preprocessing.
    • collate_fn: An optional function that specifies how to combine individual samples into a batch. PyTorch provides a default collate function that works well for many common data types.
  • Example:
Python
    from torch.utils.data import DataLoader    # Assuming 'my_dataset' is an instance of CustomImageDataset    data_loader = DataLoader(        my_dataset,        batch_size=32,        shuffle=True,        num_workers=4    )    for batch_idx, (images, labels) in enumerate(data_loader):        # Your training or evaluation logic here        pass
Benefits of using Dataset and DataLoader:
  • Abstraction: Separates data loading and preprocessing logic from model training logic.
  • Efficiency: Enables efficient batching, shuffling, and parallel data loading.
  • Scalability: Handles large datasets that may not fit entirely in memory.
  • Flexibility: Allows for custom data handling and transformations tailored to specific project needs


Below is a complete Chapter 5 on PyTorch textbook titled “Data Handling with torch.utils.data, written in a clear textbook format — including learning objectives, explanations, examples, and exercises.


Chapter 5: Data Handling with torch.utils.data


Learning Objectives

After completing this chapter, you will be able to:

  • Understand the importance of data handling in deep learning workflows.

  • Use PyTorch’s Dataset and DataLoader classes effectively.

  • Create custom datasets for various types of data (e.g., images, text, CSV).

  • Apply transformations and preprocessing techniques for better model performance.

  • Implement batch loading and data shuffling for efficient training.


5.1 Introduction

In deep learning, data is the foundation upon which all models are built. Handling data efficiently — loading, transforming, and batching — is crucial for model performance and training speed. PyTorch provides a robust and flexible framework for managing datasets through the torch.utils.data module.

This module introduces two core abstractions:

  1. Dataset — represents a collection of data samples and their labels.

  2. DataLoader — provides an efficient way to load data in batches, with options for shuffling and parallel processing.

Together, they make it easy to handle small or large datasets, whether they reside locally or online.


5.2 The Dataset Class

A Dataset in PyTorch is an abstract class representing a collection of data samples. It provides two key methods that must be defined when creating a custom dataset:

  1. __len__() — returns the total number of samples in the dataset.

  2. __getitem__(index) — retrieves a sample at a given index.

Example: Using a Built-in Dataset

PyTorch provides several built-in datasets through the torchvision.datasets module. Let’s look at an example using MNIST, a popular dataset of handwritten digits.

from torchvision import datasets, transforms

# Define a transformation
transform = transforms.ToTensor()

# Load the MNIST dataset
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

# Get length and first item
print(len(train_dataset))
image, label = train_dataset[0]
print(image.shape, label)

Explanation:

  • The dataset is automatically downloaded and transformed into tensors.

  • The transformation converts images to tensors suitable for PyTorch models.


5.3 The DataLoader Class

The DataLoader is a powerful wrapper around the Dataset that helps with:

  • Batching: Loading multiple samples at once.

  • Shuffling: Randomizing the order of data to prevent model bias.

  • Parallel Loading: Using multiple workers for faster data retrieval.

Example: Loading Data in Batches

from torch.utils.data import DataLoader

# Create a DataLoader
train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=64,
    shuffle=True
)

# Iterate over DataLoader
for batch_idx, (images, labels) in enumerate(train_loader):
    print(f"Batch {batch_idx+1}:")
    print(f"Images shape: {images.shape}")
    print(f"Labels shape: {labels.shape}")
    break

Explanation:

  • batch_size=64 loads 64 samples at a time.

  • shuffle=True ensures randomization in each epoch.

  • The loop yields batches of (images, labels) ready for training.


5.4 Creating Custom Datasets

Sometimes, built-in datasets are not enough — for example, when working with your own CSV files, text data, or images stored in folders. In such cases, we can create a custom dataset by subclassing torch.utils.data.Dataset.

Example: Custom Dataset from CSV File

Let’s say we have a CSV file containing features and labels:

feature1 feature2 label
0.5 1.2 0
0.9 0.8 1

We can create a custom dataset to read this file:

import torch
from torch.utils.data import Dataset
import pandas as pd

class CSVDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)
        self.features = self.data[['feature1', 'feature2']].values
        self.labels = self.data['label'].values
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        X = torch.tensor(self.features[idx], dtype=torch.float32)
        y = torch.tensor(self.labels[idx], dtype=torch.long)
        return X, y

# Example usage
dataset = CSVDataset('data.csv')
print(len(dataset))
print(dataset[0])

This approach provides complete flexibility for reading data from custom formats.


5.5 Data Preprocessing and Transformations

Preprocessing helps prepare data before feeding it into the neural network. PyTorch offers a set of transformation utilities under torchvision.transforms, especially for image data.

Common Transformations

Transformation Description
transforms.ToTensor() Converts image to PyTorch tensor
transforms.Normalize(mean, std) Normalizes pixel values
transforms.Resize(size) Resizes the image
transforms.RandomHorizontalFlip() Randomly flips the image horizontally
transforms.Compose() Chains multiple transformations

Example: Using Compose for Preprocessing

from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.5,), std=(0.5,))
])

You can then apply this transform while loading a dataset:

from torchvision import datasets

train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

Key Point:
Transforms are applied automatically when each item is accessed via __getitem__() in the dataset.


5.6 Batch Loading and Shuffling

Batch Loading

Batch loading improves computational efficiency by processing multiple samples together instead of one at a time.

from torch.utils.data import DataLoader

loader = DataLoader(dataset, batch_size=32)
for X_batch, y_batch in loader:
    print(X_batch.shape, y_batch.shape)

Shuffling

Shuffling ensures that the model does not learn any unintended order in the data. It’s especially important for training data.

train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

Note:

  • For training data, set shuffle=True.

  • For validation/test data, set shuffle=False.


5.7 Practical Example: Custom Image Dataset

Let’s build a complete example of a custom image dataset stored in directories.

Folder structure:

data/
    cats/
        cat1.jpg
        cat2.jpg
    dogs/
        dog1.jpg
        dog2.jpg

Code:

from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os

class CustomImageDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        self.root_dir = root_dir
        self.classes = os.listdir(root_dir)
        self.transform = transform
        self.image_paths = []
        self.labels = []

        for label, class_name in enumerate(self.classes):
            class_path = os.path.join(root_dir, class_name)
            for img_name in os.listdir(class_path):
                self.image_paths.append(os.path.join(class_path, img_name))
                self.labels.append(label)

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img = Image.open(self.image_paths[idx]).convert("RGB")
        label = self.labels[idx]
        if self.transform:
            img = self.transform(img)
        return img, label

# Define transformations
transform = transforms.Compose([
    transforms.Resize((64, 64)),
    transforms.ToTensor()
])

# Initialize dataset and dataloader
dataset = CustomImageDataset('data', transform=transform)
loader = DataLoader(dataset, batch_size=16, shuffle=True)

# Iterate
for images, labels in loader:
    print(images.shape, labels)
    break

This setup allows you to easily load and preprocess custom image datasets with minimal code.


5.8 Summary

  • The Dataset class defines how data is accessed.

  • The DataLoader class handles efficient data loading, batching, and shuffling.

  • Custom datasets enable flexibility for unique data sources.

  • Transformations are used for preprocessing and augmentation.

  • Batch loading and shuffling improve training performance and model generalization.


5.9 Exercises

  1. Conceptual Questions:

    • What are the main purposes of the Dataset and DataLoader classes?

    • Why is data shuffling important in training neural networks?

    • What is the role of transformations in preprocessing?

  2. Coding Tasks:

    • Create a custom dataset that loads tabular data from a CSV file and applies normalization.

    • Use DataLoader to load batches of data from your custom dataset.

    • Implement a transformation pipeline that includes resizing, normalization, and random flipping for images.

  3. Challenge:

    • Design a dataset class for text data (e.g., reading sentences and labels from a file) and load it using a DataLoader.


Comments