Chapter 8: Convolutional Neural Networks (CNNs) in PyTorch


Abstract:

Convolutional Neural Networks (CNNs) in PyTorch are a fundamental architecture for image processing and computer vision tasks. PyTorch provides robust tools within its torch.nn module to easily define, build, and train CNNs.
Key Components of a CNN in PyTorch:
  • Convolutional Layers (nn.Conv2d): 
    These layers apply a set of learnable filters (kernels) to the input image, extracting features such as edges, textures, or more complex patterns. Key parameters include in_channelsout_channelskernel_sizestride, and padding.
  • Activation Functions: 
    Non-linear activation functions, commonly ReLU (nn.ReLU), are applied after convolutional layers to introduce non-linearity, enabling the network to learn more complex relationships.
  • Pooling Layers (nn.MaxPool2dnn.AvgPool2d): 
    These layers reduce the spatial dimensions (width and height) of the feature maps, thereby reducing the number of parameters and computational cost, and providing a degree of translational invariance. Max pooling is a common choice.
  • Fully Connected Layers (nn.Linear): 
    After several convolutional and pooling layers, the flattened feature maps are typically passed through one or more fully connected layers for classification or regression tasks.
Building a CNN in PyTorch:
To build a CNN in PyTorch, one typically defines a class that inherits from torch.nn.Module. This class will contain:
  • __init__ method: 
    This method defines the layers of the network (e.g., nn.Conv2dnn.MaxPool2dnn.Linear).
  • forward method: 
    This method defines the forward pass of the network, specifying how the input data flows through the defined layers and activation functions to produce an output.
Example Structure:
Python
import torchimport torch.nn as nnimport torch.nn.functional as Fclass SimpleCNN(nn.Module):    def __init__(self):        super(SimpleCNN, self).__init__()        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1) # 3 input channels (RGB), 16 output channels        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)        self.fc1 = nn.Linear(32 * 8 * 8, 128) # Example: assuming input image size leads to 8x8 after pooling        self.fc2 = nn.Linear(128, 10) # 10 output classes    def forward(self, x):        x = self.pool(F.relu(self.conv1(x)))        x = self.pool(F.relu(self.conv2(x)))        x = x.view(-1, 32 * 8 * 8) # Flatten the feature maps        x = F.relu(self.fc1(x))        x = self.fc2(x)        return x
Training a CNN:
Training a CNN in PyTorch involves:
  • Data Preparation: Loading and transforming image datasets using torchvision.datasets and torchvision.transforms, and creating data loaders with torch.utils.data.DataLoader.
  • Loss Function: Defining a suitable loss function (e.g., nn.CrossEntropyLoss for classification).
  • Optimizer: Choosing an optimizer (e.g., torch.optim.Adamtorch.optim.SGD).
  • Training Loop: Iterating through epochs, performing forward and backward passes, and updating model parameters based on the calculated loss and optimizer.
  • Evaluation: Assessing the model's performance on a test set

Here’s a complete Chapter 8: Convolutional Neural Networks (CNNs) written in textbook format — with learning objectives, detailed explanations, examples, diagrams (conceptually described), and exercises 


Chapter 8: Convolutional Neural Networks (CNNs)

Learning Objectives

After completing this chapter, readers will be able to:

  • Understand the core architecture and principles behind Convolutional Neural Networks (CNNs).

  • Explain the roles of convolution, pooling, and padding operations.

  • Build and train CNN models using PyTorch.

  • Implement image classification tasks using popular datasets such as MNIST or CIFAR-10.

  • Apply transfer learning and fine-tuning for efficient model reuse and improvement.


8.1 Fundamentals of CNNs

8.1.1 Introduction

Convolutional Neural Networks (CNNs) are a class of deep neural networks primarily designed for processing data with grid-like topology, such as images.
Unlike traditional feedforward networks that flatten image data, CNNs preserve spatial relationships by using convolutional layers that slide filters (kernels) across input images to extract hierarchical features.

CNNs are the backbone of modern computer vision applications, including:

  • Image classification (e.g., identifying cats vs. dogs)

  • Object detection (e.g., autonomous vehicles)

  • Image segmentation (e.g., medical imaging)

  • Facial recognition and gesture detection


8.1.2 CNN Architecture Overview

A typical CNN is composed of several key layers:

  1. Convolutional Layer – Performs convolution operations to extract features.

  2. Activation Function (ReLU) – Introduces non-linearity.

  3. Pooling Layer – Reduces spatial dimensions and computational load.

  4. Fully Connected (FC) Layer – Combines extracted features for classification.

  5. Output Layer – Produces final class predictions.

Figure (conceptual):
A block diagram of CNN architecture showing flow: Input Image → Convolution → ReLU → Pooling → Flatten → Fully Connected → Output.


8.2 Convolution, Pooling, and Padding

8.2.1 Convolution Operation

A convolution is the process of sliding a filter (kernel) across an image and computing a dot product between the kernel and overlapping regions of the image.
This produces a feature map that highlights important patterns like edges, textures, and corners.

Mathematical Expression:
[
S(i, j) = (I * K)(i, j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m, n)
]
Where:

  • (I) = Input image

  • (K) = Kernel/filter

  • (S) = Output feature map

Example:
If the kernel detects horizontal edges, it will respond strongly to horizontal transitions in pixel intensity.


8.2.2 Pooling Operation

Pooling reduces the dimensionality of feature maps while retaining the most important information.
Common types include:

  • Max Pooling: Takes the maximum value from each window.

  • Average Pooling: Takes the average value.

Purpose:

  • Reduces computational cost.

  • Controls overfitting.

  • Provides translation invariance.

Example:
A 2×2 max pooling on a 4×4 feature map reduces it to 2×2 by selecting the maximum value in each block.


8.2.3 Padding

When a convolutional filter slides over an image, the feature map size decreases.
Padding involves adding extra pixels (usually zeros) around the image border to:

  • Preserve output size.

  • Prevent loss of edge information.

Types of Padding:

  • Valid Padding (no padding): Output shrinks.

  • Same Padding: Maintains the same output dimensions as input.


8.2.4 Stride

Stride defines the step size with which the kernel moves across the image.
A stride of 1 results in overlapping windows, while a stride of 2 reduces the output size.


8.3 Building CNNs with PyTorch

8.3.1 PyTorch Modules Overview

PyTorch offers modules such as:

  • torch.nn.Conv2d – For convolutional layers.

  • torch.nn.ReLU – Activation function.

  • torch.nn.MaxPool2d – Pooling operation.

  • torch.nn.Linear – Fully connected layers.

  • torch.nn.CrossEntropyLoss – Common loss function for classification.

  • torch.optim.Adam – Optimizer for training.


8.3.2 Example: A Simple CNN Model

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)  # 1 input channel, 32 filters
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)  # For 10 classes (MNIST)
    
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * 7 * 7)  # Flatten
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

8.3.3 Training the CNN

import torch.optim as optim
from torchvision import datasets, transforms

# Data loading and preprocessing
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

trainset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

# Initialize model, loss, and optimizer
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(5):
    running_loss = 0.0
    for images, labels in trainloader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {running_loss/len(trainloader)}")

8.4 Image Classification Example (CIFAR-10 / MNIST)

The MNIST dataset contains grayscale images of handwritten digits (0–9).
The CIFAR-10 dataset includes 60,000 color images across 10 classes such as airplanes, cars, and animals.

Example for CIFAR-10:

from torchvision import datasets, transforms, models

transform = transforms.Compose([
    transforms.Resize(32),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

# Use pretrained CNN model
model = models.resnet18(pretrained=False, num_classes=10)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

8.5 Transfer Learning and Fine-tuning

8.5.1 What is Transfer Learning?

Transfer learning leverages a pre-trained model (trained on large datasets like ImageNet) and adapts it for a new, smaller task.
This reduces training time and improves performance, especially when labeled data is limited.


8.5.2 Fine-tuning Steps

  1. Load a Pre-trained Model

    model = models.resnet18(pretrained=True)
    
  2. Freeze Initial Layers

    for param in model.parameters():
        param.requires_grad = False
    
  3. Modify the Final Layer

    model.fc = nn.Linear(model.fc.in_features, 10)  # For 10 classes
    
  4. Train Only the New Layers

    optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
    

This allows the model to retain general features (edges, colors, shapes) while adapting to specific features of your new dataset.


8.6 Summary

  • CNNs are designed to capture spatial hierarchies in image data using convolution, pooling, and padding.

  • Convolution layers extract features; pooling layers reduce dimensionality.

  • PyTorch provides powerful abstractions (Conv2d, MaxPool2d, etc.) for building CNNs.

  • Image classification can be implemented easily using datasets like MNIST or CIFAR-10.

  • Transfer learning enables faster and more efficient model adaptation.


8.7 Exercises

  1. Conceptual Questions
    a. Explain how convolution helps in feature extraction.
    b. What is the role of padding in CNNs?
    c. Differentiate between max pooling and average pooling.
    d. Why is ReLU preferred as an activation function in CNNs?

  2. Coding Tasks
    a. Modify the SimpleCNN to work with CIFAR-10 dataset (3 input channels).
    b. Implement a CNN using BatchNorm2d to improve stability.
    c. Visualize feature maps from the first convolutional layer.
    d. Apply transfer learning using VGG16 and fine-tune the final layers.

  3. Research/Project Idea
    Train a CNN to classify medical X-ray images (normal vs. pneumonia) using transfer learning with ResNet50.

Comments