Chapter 5: Data Handling with torch.utils.data with PyTorch
Abstract:
torch.utils.data module provides essential tools for efficient and organized data handling, primarily through the Dataset and DataLoader classes. These abstractions streamline the process of loading, preprocessing, and feeding data into a model, especially for large or complex datasets. torch.utils.data.Dataset:- Purpose: This is an abstract class that represents a dataset. You typically create a custom dataset by subclassing 
Datasetand implementing two key methods:__len__(self): Returns the total number of samples in the dataset.__getitem__(self, idx): Retrieves a single sample and its corresponding label (or other target information) at the given indexidx. This is where you would load data from disk, apply transformations, and prepare it for your model.
 - Example:
 
    import torch    from torch.utils.data import Dataset    class CustomImageDataset(Dataset):        def __init__(self, image_paths, labels, transform=None):            self.image_paths = image_paths            self.labels = labels            self.transform = transform        def __len__(self):            return len(self.image_paths)        def __getitem__(self, idx):            # Load image from self.image_paths[idx]            # Apply self.transform if provided            # Return image and self.labels[idx]            pass # Placeholder for actual implementationtorch.utils.data.DataLoader:- Purpose: This class wraps an iterable around a 
Datasetto enable efficient batching, shuffling, and multi-threaded data loading. It acts as a "delivery truck" that fetches data from yourDatasetand prepares it for your model during training or evaluation. - Key Arguments:
dataset: An instance of yourDataset(or a built-in PyTorch dataset).batch_size: The number of samples to load in each batch.shuffle: Boolean indicating whether to shuffle the data order at the beginning of each epoch.num_workers: Number of subprocesses to use for data loading. This can significantly speed up data loading, especially with large datasets and complex preprocessing.collate_fn: An optional function that specifies how to combine individual samples into a batch. PyTorch provides a default collate function that works well for many common data types.
 - Example:
 
    from torch.utils.data import DataLoader    # Assuming 'my_dataset' is an instance of CustomImageDataset    data_loader = DataLoader(        my_dataset,        batch_size=32,        shuffle=True,        num_workers=4    )    for batch_idx, (images, labels) in enumerate(data_loader):        # Your training or evaluation logic here        passDataset and DataLoader:- Abstraction: Separates data loading and preprocessing logic from model training logic.
 - Efficiency: Enables efficient batching, shuffling, and parallel data loading.
 - Scalability: Handles large datasets that may not fit entirely in memory.
 - Flexibility: Allows for custom data handling and transformations tailored to specific project needs
 
Below is a complete Chapter 5 on PyTorch textbook titled “Data Handling with torch.utils.data”, written in a clear textbook format — including learning objectives, explanations, examples, and exercises.
Chapter 5: Data Handling with torch.utils.data
Learning Objectives
After completing this chapter, you will be able to:
- 
Understand the importance of data handling in deep learning workflows.
 - 
Use PyTorch’s
DatasetandDataLoaderclasses effectively. - 
Create custom datasets for various types of data (e.g., images, text, CSV).
 - 
Apply transformations and preprocessing techniques for better model performance.
 - 
Implement batch loading and data shuffling for efficient training.
 
5.1 Introduction
In deep learning, data is the foundation upon which all models are built. Handling data efficiently — loading, transforming, and batching — is crucial for model performance and training speed. PyTorch provides a robust and flexible framework for managing datasets through the torch.utils.data module.
This module introduces two core abstractions:
- 
Dataset— represents a collection of data samples and their labels. - 
DataLoader— provides an efficient way to load data in batches, with options for shuffling and parallel processing. 
Together, they make it easy to handle small or large datasets, whether they reside locally or online.
5.2 The Dataset Class
A Dataset in PyTorch is an abstract class representing a collection of data samples. It provides two key methods that must be defined when creating a custom dataset:
- 
__len__()— returns the total number of samples in the dataset. - 
__getitem__(index)— retrieves a sample at a given index. 
Example: Using a Built-in Dataset
PyTorch provides several built-in datasets through the torchvision.datasets module. Let’s look at an example using MNIST, a popular dataset of handwritten digits.
from torchvision import datasets, transforms
# Define a transformation
transform = transforms.ToTensor()
# Load the MNIST dataset
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)
# Get length and first item
print(len(train_dataset))
image, label = train_dataset[0]
print(image.shape, label)
Explanation:
- 
The dataset is automatically downloaded and transformed into tensors.
 - 
The transformation converts images to tensors suitable for PyTorch models.
 
5.3 The DataLoader Class
The DataLoader is a powerful wrapper around the Dataset that helps with:
- 
Batching: Loading multiple samples at once.
 - 
Shuffling: Randomizing the order of data to prevent model bias.
 - 
Parallel Loading: Using multiple workers for faster data retrieval.
 
Example: Loading Data in Batches
from torch.utils.data import DataLoader
# Create a DataLoader
train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=64,
    shuffle=True
)
# Iterate over DataLoader
for batch_idx, (images, labels) in enumerate(train_loader):
    print(f"Batch {batch_idx+1}:")
    print(f"Images shape: {images.shape}")
    print(f"Labels shape: {labels.shape}")
    break
Explanation:
- 
batch_size=64loads 64 samples at a time. - 
shuffle=Trueensures randomization in each epoch. - 
The loop yields batches of
(images, labels)ready for training. 
5.4 Creating Custom Datasets
Sometimes, built-in datasets are not enough — for example, when working with your own CSV files, text data, or images stored in folders. In such cases, we can create a custom dataset by subclassing torch.utils.data.Dataset.
Example: Custom Dataset from CSV File
Let’s say we have a CSV file containing features and labels:
| feature1 | feature2 | label | 
|---|---|---|
| 0.5 | 1.2 | 0 | 
| 0.9 | 0.8 | 1 | 
We can create a custom dataset to read this file:
import torch
from torch.utils.data import Dataset
import pandas as pd
class CSVDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)
        self.features = self.data[['feature1', 'feature2']].values
        self.labels = self.data['label'].values
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        X = torch.tensor(self.features[idx], dtype=torch.float32)
        y = torch.tensor(self.labels[idx], dtype=torch.long)
        return X, y
# Example usage
dataset = CSVDataset('data.csv')
print(len(dataset))
print(dataset[0])
This approach provides complete flexibility for reading data from custom formats.
5.5 Data Preprocessing and Transformations
Preprocessing helps prepare data before feeding it into the neural network. PyTorch offers a set of transformation utilities under torchvision.transforms, especially for image data.
Common Transformations
| Transformation | Description | 
|---|---|
transforms.ToTensor() | 
Converts image to PyTorch tensor | 
transforms.Normalize(mean, std) | 
Normalizes pixel values | 
transforms.Resize(size) | 
Resizes the image | 
transforms.RandomHorizontalFlip() | 
Randomly flips the image horizontally | 
transforms.Compose() | 
Chains multiple transformations | 
Example: Using Compose for Preprocessing
from torchvision import transforms
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.5,), std=(0.5,))
])
You can then apply this transform while loading a dataset:
from torchvision import datasets
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)
Key Point:
Transforms are applied automatically when each item is accessed via __getitem__() in the dataset.
5.6 Batch Loading and Shuffling
Batch Loading
Batch loading improves computational efficiency by processing multiple samples together instead of one at a time.
from torch.utils.data import DataLoader
loader = DataLoader(dataset, batch_size=32)
for X_batch, y_batch in loader:
    print(X_batch.shape, y_batch.shape)
Shuffling
Shuffling ensures that the model does not learn any unintended order in the data. It’s especially important for training data.
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)
Note:
- 
For training data, set
shuffle=True. - 
For validation/test data, set
shuffle=False. 
5.7 Practical Example: Custom Image Dataset
Let’s build a complete example of a custom image dataset stored in directories.
Folder structure:
data/
    cats/
        cat1.jpg
        cat2.jpg
    dogs/
        dog1.jpg
        dog2.jpg
Code:
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os
class CustomImageDataset(Dataset):
    def __init__(self, root_dir, transform=None):
        self.root_dir = root_dir
        self.classes = os.listdir(root_dir)
        self.transform = transform
        self.image_paths = []
        self.labels = []
        for label, class_name in enumerate(self.classes):
            class_path = os.path.join(root_dir, class_name)
            for img_name in os.listdir(class_path):
                self.image_paths.append(os.path.join(class_path, img_name))
                self.labels.append(label)
    def __len__(self):
        return len(self.image_paths)
    def __getitem__(self, idx):
        img = Image.open(self.image_paths[idx]).convert("RGB")
        label = self.labels[idx]
        if self.transform:
            img = self.transform(img)
        return img, label
# Define transformations
transform = transforms.Compose([
    transforms.Resize((64, 64)),
    transforms.ToTensor()
])
# Initialize dataset and dataloader
dataset = CustomImageDataset('data', transform=transform)
loader = DataLoader(dataset, batch_size=16, shuffle=True)
# Iterate
for images, labels in loader:
    print(images.shape, labels)
    break
This setup allows you to easily load and preprocess custom image datasets with minimal code.
5.8 Summary
- 
The
Datasetclass defines how data is accessed. - 
The
DataLoaderclass handles efficient data loading, batching, and shuffling. - 
Custom datasets enable flexibility for unique data sources.
 - 
Transformations are used for preprocessing and augmentation.
 - 
Batch loading and shuffling improve training performance and model generalization.
 
5.9 Exercises
- 
Conceptual Questions:
- 
What are the main purposes of the
DatasetandDataLoaderclasses? - 
Why is data shuffling important in training neural networks?
 - 
What is the role of transformations in preprocessing?
 
 - 
 - 
Coding Tasks:
- 
Create a custom dataset that loads tabular data from a CSV file and applies normalization.
 - 
Use
DataLoaderto load batches of data from your custom dataset. - 
Implement a transformation pipeline that includes resizing, normalization, and random flipping for images.
 
 - 
 - 
Challenge:
- 
Design a dataset class for text data (e.g., reading sentences and labels from a file) and load it using a
DataLoader. 
 - 
 
Comments
Post a Comment
"Thank you for seeking advice on your career journey! Our team is dedicated to providing personalized guidance on education and success. Please share your specific questions or concerns, and we'll assist you in navigating the path to a fulfilling and successful career."