Chapter 22: Computer Vision Project in PyTorch


Abstract:

Developing a computer vision project in PyTorch involves a structured approach, leveraging PyTorch's capabilities and its torchvision library for tasks like image classification, object detection, or segmentation.
1. Project Conception and Data Acquisition:
  • Define the Task: 
    Clearly identify the computer vision problem you aim to solve (e.g., classifying dog breeds, detecting cars in images, segmenting medical images).
  • Data Collection/Selection: 
    Obtain a relevant dataset. This could be a pre-existing dataset (like CIFAR-10, ImageNet, COCO) or a custom dataset collected for your specific project.
2. Data Preprocessing and Loading:
  • Transformations: 
    Apply necessary image transformations using torchvision.transforms for data augmentation (e.g., resizing, cropping, normalization, random rotations/flips) to improve model generalization.
  • Dataset Creation: 
    Create a custom dataset class inheriting from torch.utils.data.Dataset if using a custom dataset, or use pre-built datasets from torchvision.datasets. This class handles loading individual images and their corresponding labels.
  • DataLoader: 
    Create torch.utils.data.DataLoader instances for training, validation, and testing. This efficiently batches and shuffles the data for training.
3. Model Definition:
  • Architecture Selection: 
    Choose or design a suitable neural network architecture (e.g., CNNs like ResNet, VGG, MobileNet for classification; Faster R-CNN, YOLO for object detection; U-Net for segmentation).
  • Model Implementation: 
    Define your model by subclassing torch.nn.Module. Implement the __init__ method to define layers and the forward method to specify the data flow through the network.
  • Transfer Learning (Optional): 
    Consider using pre-trained models from torchvision.models and fine-tuning them on your specific dataset, especially when working with limited data.
4. Training Configuration:
  • Loss Function: 
    Select an appropriate loss function based on your task (e.g., nn.CrossEntropyLoss for classification, nn.MSELoss or nn.L1Loss for regression, specialized losses for object detection/segmentation
  • Optimizer: 
    Choose an optimizer (e.g., torch.optim.Adamtorch.optim.SGD) to update model weights during training.
  • Learning Rate Scheduler (Optional): 
    Implement a learning rate scheduler to adjust the learning rate during training, potentially improving convergence.
5. Training Loop:
  • Iteration: Iterate through epochs and batches of data from the DataLoader.
  • Forward Pass: Pass input data through the model to get predictions.
  • Loss Calculation: Compute the loss between predictions and ground truth labels.
  • Backward Pass: Perform backpropagation to calculate gradients.
  • Optimizer Step: Update model weights using the optimizer.
  • Evaluation: Periodically evaluate the model on a validation set to monitor performance and prevent overfitting.
6. Evaluation and Deployment:
  • Testing: 
    Evaluate the trained model on a separate test set to assess its generalization performance on unseen data.
  • Metrics: 
    Calculate relevant metrics for your task (e.g., accuracy, precision, recall, F1-score for classification; IoU, mAP for object detection/segmentation).
  • Deployment (Optional): 
    If applicable, deploy the trained model for real-world inference. This might involve converting the model to a production-ready format or integrating it into an application.

Below is Chapter 22: Computer Vision Project written in a complete, structured, book-ready format.
It includes conceptual explanations, workflow, architecture discussions, coding insights, evaluation, and best practices.


Chapter 22: Computer Vision Project

Object Detection with YOLO / Faster R-CNN & Image Segmentation with U-Net


22.1 Introduction

Computer Vision (CV) lies at the core of modern Artificial Intelligence systems, enabling machines to perceive, understand, and interpret visual information from images and videos. Among the vast range of CV tasks, object detection and image segmentation are two of the most widely used and powerful applications.

This chapter presents a complete project pipeline for:

  1. Object Detection using:

    • YOLO (You Only Look Once) – real-time, one-stage detector

    • Faster R-CNN – two-stage, high-accuracy detector

  2. Image Segmentation using:

    • U-Net – encoder–decoder architecture for biomedical and general segmentation tasks

The chapter covers data preprocessing, model architecture, training, evaluation, and deployment guidelines.


22.2 Object Detection Project

Object detection involves locating objects (bounding boxes) and classifying them within an image.
The two most popular architectures—YOLO and Faster R-CNN—represent opposite philosophies:

  • YOLO: Fast, single-shot detector for real-time use.

  • Faster R-CNN: Accurate, two-stage detector for high-quality predictions.


22.2.1 Understanding YOLO: A One-Stage Detector

How YOLO Works

YOLO divides the input image into an S × S grid.
Each grid cell predicts:

  • Bounding box coordinates

  • Objectness probability

  • Class probabilities

These predictions are combined to detect multiple objects in a single forward pass.

Key Features

  • Extremely fast (real-time capable)

  • Good accuracy for large, well-separated objects

  • Modern versions (YOLOv5, YOLOv7, YOLOv8) are modular and lightweight

YOLO Architecture Overview

  1. Backbone: Extracts features (CSPDarknet, YOLOv5 CSP modules, etc.)

  2. Neck: Combines features using FPN/PAN

  3. Head: Predicts bounding boxes, confidences, and classes

YOLO Use Cases

  • Surveillance systems

  • Autonomous drones

  • Traffic monitoring

  • Retail checkout systems

  • Industrial inspection


22.2.2 Faster R-CNN: A Two-Stage Detector

Faster R-CNN follows a more detailed process:

Stage 1 – Region Proposal Network (RPN)

Generates candidate regions (anchors) where objects may exist.

Stage 2 – Classification & Bounding Box Refinement

A CNN (e.g., ResNet) classifies the proposals + refines boxes.


Why Use Faster R-CNN?

  • Highly accurate

  • Robust for small objects

  • Performs well on VOC/COCO datasets

  • Suitable for medical imaging, aerial imagery, and robotics

Limitations

  • Slower than YOLO

  • Difficult for real-time applications


22.2.3 Data Pipeline for Object Detection

Regardless of architecture, the pipeline includes:

1. Dataset Selection

Popular datasets:

  • COCO

  • Pascal VOC

  • Open Images

  • Custom datasets (via Roboflow, CVAT, LabelImg)

2. Annotation Format

Most common:

  • YOLO TXT format

  • VOC XML

  • COCO JSON

3. Data Preprocessing

  • Resize images

  • Normalize pixel values

  • Data Augmentation:

    • Random flip

    • Color jitter

    • Mosaic augmentation (YOLO)

    • Random rotation

4. Building the Dataloader

For PyTorch:

from torch.utils.data import DataLoader
train_loader = DataLoader(dataset, batch_size=16, shuffle=True, collate_fn=custom_collate)

22.2.4 YOLO Implementation Workflow

Step 1: Install YOLO Framework (e.g., YOLOv5)

git clone https://github.com/ultralytics/yolov5
pip install -r yolov5/requirements.txt

Step 2: Train YOLO

python train.py --img 640 --batch 16 --epochs 50 --data data.yaml --weights yolov5s.pt

Step 3: Inference on Images

python detect.py --weights best.pt --source test_images/

Step 4: Evaluate Performance

  • mAP@0.5

  • Precision/Recall

  • Inference latency


22.2.5 Faster R-CNN Implementation Workflow

Step 1: Load Pre-trained Model

import torchvision
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

Step 2: Modify for Custom Classes

in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

Step 3: Training Loop

for images, targets in train_loader:
    loss_dict = model(images, targets)
    loss = sum(loss_dict.values())
    loss.backward()
    optimizer.step()

Step 4: Evaluation Metrics

  • mAP@IoU 0.50

  • AP for each class

  • Confusion matrix of detections


22.3 Image Segmentation Project Using U-Net

Segmentation assigns a class label to every pixel, resulting in a mask that identifies object boundaries precisely.

U-Net is the most popular architecture, originally built for biomedical segmentation.


22.3.1 U-Net Architecture Overview

U-Net has two paths:

1. Encoder Path (Contracting)

  • Convolution layers

  • ReLU activation

  • Max pooling

  • Captures contextual information

2. Decoder Path (Expanding)

  • Up-convolution (transposed convolution)

  • Skip connections (merging encoder features)

  • Restores spatial resolution

3. Skip Connections

Enable recovering fine-grained details → prevents loss of spatial information.


22.3.2 Applications of U-Net

  • Medical image segmentation (tumors, organs)

  • Road and lane segmentation (autonomous driving)

  • Satellite image analysis

  • Agriculture (crop segmentation)

  • Forestry and environment monitoring


22.3.3 Dataset Pipeline for Segmentation

Input Data

  • RGB image → H × W × 3

  • Mask → H × W × C (C = number of classes)

Augmentation Techniques

  • Random crop

  • Horizontal/vertical flip

  • Elastic deformation

  • CLAHE for medical images

Preprocessing

  • Normalize images

  • One-hot encode masks

  • Resize to fixed input (e.g., 256×256)


22.3.4 PyTorch Implementation of U-Net

Simplified U-Net Model

class UNet(nn.Module):
    def __init__(self, in_channels=3, out_channels=1):
        super().__init__()

        def conv_block(in_c, out_c):
            return nn.Sequential(
                nn.Conv2d(in_c, out_c, 3, padding=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(out_c, out_c, 3, padding=1),
                nn.ReLU(inplace=True)
            )

        self.down1 = conv_block(3, 64)
        self.pool = nn.MaxPool2d(2)
        self.down2 = conv_block(64, 128)

        self.bridge = conv_block(128, 256)

        self.up1 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.up_block1 = conv_block(256, 128)

        self.up2 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.up_block2 = conv_block(128, 64)

        self.final = nn.Conv2d(64, out_channels, 1)

    def forward(self, x):
        d1 = self.down1(x)
        d2 = self.down2(self.pool(d1))
        b = self.bridge(self.pool(d2))

        u1 = self.up1(b)
        u1 = self.up_block1(torch.cat([u1, d2], dim=1))

        u2 = self.up2(u1)
        u2 = self.up_block2(torch.cat([u2, d1], dim=1))

        return self.final(u2)

22.3.5 Training Loop

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

for images, masks in train_loader:
    preds = model(images)
    loss = criterion(preds, masks)
    loss.backward()
    optimizer.step()

22.3.6 Evaluation Metrics for Segmentation

  • IoU (Intersection over Union)

  • Dice Coefficient

  • Pixel Accuracy

  • Precision/Recall for masks

Dice Score

[
Dice = \frac{2TP}{2TP + FP + FN}
]


22.3.7 Visualization of Results

Overlay the predicted mask on the original image:

overlay = 0.6 * image + 0.4 * mask

Use tools:

  • Matplotlib

  • OpenCV (cv2.addWeighted)

  • TensorBoard


22.4 Deployment Options

1. Web Deployment

  • FastAPI with endpoints for image upload

  • Serve YOLO/UNet models using PyTorch or ONNX Runtime

2. Mobile Deployment

  • Export to ONNX → convert to CoreML / TFLite

  • Suitable for edge devices

3. Cloud Deployment

  • AWS Lambda

  • GPU-enabled servers

  • Dockerized services


22.5 Challenges and Best Practices

Common Challenges

  • Annotating large datasets

  • Handling imbalanced classes

  • Small object detection

  • Maintaining consistent masks for segmentation

Best Practices

  • Use pre-trained models to reduce training time

  • Apply advanced augmentations (Mosaic, MixUp, Elastic)

  • Use early stopping

  • Monitor TensorBoard for overfitting

  • Validate model with real-world test data


22.6 Conclusion

In this chapter, we explored the complete workflow of a computer vision project covering:

  • Object detection using YOLO and Faster R-CNN

  • Image segmentation using U-Net

  • Data pipelines, architecture understanding, PyTorch implementations

  • Evaluation methods and deployment strategies

These skills are foundational for creating real-world AI applications in healthcare, security, transportation, agriculture, manufacturing, and more. The next chapters will help extend these foundational concepts to more advanced AI topics and larger project ecosystems.

Comments