Chapter 8: Convolutional Neural Networks (CNNs) in PyTorch
Abstract:
torch.nn module to easily define, build, and train CNNs.- Convolutional Layers (
nn.Conv2d):These layers apply a set of learnable filters (kernels) to the input image, extracting features such as edges, textures, or more complex patterns. Key parameters includein_channels,out_channels,kernel_size,stride, andpadding. - Activation Functions:Non-linear activation functions, commonly ReLU (
nn.ReLU), are applied after convolutional layers to introduce non-linearity, enabling the network to learn more complex relationships. - Pooling Layers (
nn.MaxPool2d,nn.AvgPool2d):These layers reduce the spatial dimensions (width and height) of the feature maps, thereby reducing the number of parameters and computational cost, and providing a degree of translational invariance. Max pooling is a common choice. - Fully Connected Layers (
nn.Linear):After several convolutional and pooling layers, the flattened feature maps are typically passed through one or more fully connected layers for classification or regression tasks.
torch.nn.Module. This class will contain:__init__method:This method defines the layers of the network (e.g.,nn.Conv2d,nn.MaxPool2d,nn.Linear).forwardmethod:This method defines the forward pass of the network, specifying how the input data flows through the defined layers and activation functions to produce an output.
import torchimport torch.nn as nnimport torch.nn.functional as Fclass SimpleCNN(nn.Module): def __init__(self): super(SimpleCNN, self).__init__() self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1) # 3 input channels (RGB), 16 output channels self.pool = nn.MaxPool2d(kernel_size=2, stride=2) self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1) self.fc1 = nn.Linear(32 * 8 * 8, 128) # Example: assuming input image size leads to 8x8 after pooling self.fc2 = nn.Linear(128, 10) # 10 output classes def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 32 * 8 * 8) # Flatten the feature maps x = F.relu(self.fc1(x)) x = self.fc2(x) return x- Data Preparation: Loading and transforming image datasets using
torchvision.datasetsandtorchvision.transforms, and creating data loaders withtorch.utils.data.DataLoader. - Loss Function: Defining a suitable loss function (e.g.,
nn.CrossEntropyLossfor classification). - Optimizer: Choosing an optimizer (e.g.,
torch.optim.Adam,torch.optim.SGD). - Training Loop: Iterating through epochs, performing forward and backward passes, and updating model parameters based on the calculated loss and optimizer.
- Evaluation: Assessing the model's performance on a test set
Here’s a complete Chapter 8: Convolutional Neural Networks (CNNs) written in textbook format — with learning objectives, detailed explanations, examples, diagrams (conceptually described), and exercises
Chapter 8: Convolutional Neural Networks (CNNs)
Learning Objectives
After completing this chapter, readers will be able to:
-
Understand the core architecture and principles behind Convolutional Neural Networks (CNNs).
-
Explain the roles of convolution, pooling, and padding operations.
-
Build and train CNN models using PyTorch.
-
Implement image classification tasks using popular datasets such as MNIST or CIFAR-10.
-
Apply transfer learning and fine-tuning for efficient model reuse and improvement.
8.1 Fundamentals of CNNs
8.1.1 Introduction
Convolutional Neural Networks (CNNs) are a class of deep neural networks primarily designed for processing data with grid-like topology, such as images.
Unlike traditional feedforward networks that flatten image data, CNNs preserve spatial relationships by using convolutional layers that slide filters (kernels) across input images to extract hierarchical features.
CNNs are the backbone of modern computer vision applications, including:
-
Image classification (e.g., identifying cats vs. dogs)
-
Object detection (e.g., autonomous vehicles)
-
Image segmentation (e.g., medical imaging)
-
Facial recognition and gesture detection
8.1.2 CNN Architecture Overview
A typical CNN is composed of several key layers:
-
Convolutional Layer – Performs convolution operations to extract features.
-
Activation Function (ReLU) – Introduces non-linearity.
-
Pooling Layer – Reduces spatial dimensions and computational load.
-
Fully Connected (FC) Layer – Combines extracted features for classification.
-
Output Layer – Produces final class predictions.
Figure (conceptual):
A block diagram of CNN architecture showing flow: Input Image → Convolution → ReLU → Pooling → Flatten → Fully Connected → Output.
8.2 Convolution, Pooling, and Padding
8.2.1 Convolution Operation
A convolution is the process of sliding a filter (kernel) across an image and computing a dot product between the kernel and overlapping regions of the image.
This produces a feature map that highlights important patterns like edges, textures, and corners.
Mathematical Expression:
[
S(i, j) = (I * K)(i, j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m, n)
]
Where:
-
(I) = Input image
-
(K) = Kernel/filter
-
(S) = Output feature map
Example:
If the kernel detects horizontal edges, it will respond strongly to horizontal transitions in pixel intensity.
8.2.2 Pooling Operation
Pooling reduces the dimensionality of feature maps while retaining the most important information.
Common types include:
-
Max Pooling: Takes the maximum value from each window.
-
Average Pooling: Takes the average value.
Purpose:
-
Reduces computational cost.
-
Controls overfitting.
-
Provides translation invariance.
Example:
A 2×2 max pooling on a 4×4 feature map reduces it to 2×2 by selecting the maximum value in each block.
8.2.3 Padding
When a convolutional filter slides over an image, the feature map size decreases.
Padding involves adding extra pixels (usually zeros) around the image border to:
-
Preserve output size.
-
Prevent loss of edge information.
Types of Padding:
-
Valid Padding (no padding): Output shrinks.
-
Same Padding: Maintains the same output dimensions as input.
8.2.4 Stride
Stride defines the step size with which the kernel moves across the image.
A stride of 1 results in overlapping windows, while a stride of 2 reduces the output size.
8.3 Building CNNs with PyTorch
8.3.1 PyTorch Modules Overview
PyTorch offers modules such as:
-
torch.nn.Conv2d– For convolutional layers. -
torch.nn.ReLU– Activation function. -
torch.nn.MaxPool2d– Pooling operation. -
torch.nn.Linear– Fully connected layers. -
torch.nn.CrossEntropyLoss– Common loss function for classification. -
torch.optim.Adam– Optimizer for training.
8.3.2 Example: A Simple CNN Model
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, padding=1) # 1 input channel, 32 filters
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.fc1 = nn.Linear(64 * 7 * 7, 128)
self.fc2 = nn.Linear(128, 10) # For 10 classes (MNIST)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 64 * 7 * 7) # Flatten
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
8.3.3 Training the CNN
import torch.optim as optim
from torchvision import datasets, transforms
# Data loading and preprocessing
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
trainset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
# Initialize model, loss, and optimizer
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(5):
running_loss = 0.0
for images, labels in trainloader:
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f"Epoch {epoch+1}, Loss: {running_loss/len(trainloader)}")
8.4 Image Classification Example (CIFAR-10 / MNIST)
The MNIST dataset contains grayscale images of handwritten digits (0–9).
The CIFAR-10 dataset includes 60,000 color images across 10 classes such as airplanes, cars, and animals.
Example for CIFAR-10:
from torchvision import datasets, transforms, models
transform = transforms.Compose([
transforms.Resize(32),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
# Use pretrained CNN model
model = models.resnet18(pretrained=False, num_classes=10)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
8.5 Transfer Learning and Fine-tuning
8.5.1 What is Transfer Learning?
Transfer learning leverages a pre-trained model (trained on large datasets like ImageNet) and adapts it for a new, smaller task.
This reduces training time and improves performance, especially when labeled data is limited.
8.5.2 Fine-tuning Steps
-
Load a Pre-trained Model
model = models.resnet18(pretrained=True) -
Freeze Initial Layers
for param in model.parameters(): param.requires_grad = False -
Modify the Final Layer
model.fc = nn.Linear(model.fc.in_features, 10) # For 10 classes -
Train Only the New Layers
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
This allows the model to retain general features (edges, colors, shapes) while adapting to specific features of your new dataset.
8.6 Summary
-
CNNs are designed to capture spatial hierarchies in image data using convolution, pooling, and padding.
-
Convolution layers extract features; pooling layers reduce dimensionality.
-
PyTorch provides powerful abstractions (
Conv2d,MaxPool2d, etc.) for building CNNs. -
Image classification can be implemented easily using datasets like MNIST or CIFAR-10.
-
Transfer learning enables faster and more efficient model adaptation.
8.7 Exercises
-
Conceptual Questions
a. Explain how convolution helps in feature extraction.
b. What is the role of padding in CNNs?
c. Differentiate between max pooling and average pooling.
d. Why is ReLU preferred as an activation function in CNNs? -
Coding Tasks
a. Modify theSimpleCNNto work with CIFAR-10 dataset (3 input channels).
b. Implement a CNN usingBatchNorm2dto improve stability.
c. Visualize feature maps from the first convolutional layer.
d. Apply transfer learning usingVGG16and fine-tune the final layers. -
Research/Project Idea
Train a CNN to classify medical X-ray images (normal vs. pneumonia) using transfer learning withResNet50.
Comments
Post a Comment
"Thank you for seeking advice on your career journey! Our team is dedicated to providing personalized guidance on education and success. Please share your specific questions or concerns, and we'll assist you in navigating the path to a fulfilling and successful career."