Chapter 16: Model Evaluation, Saving, and Loading with PyTorch

Abstract:

Model Evaluation, Saving, and Loading with PyTorch
1. Model Evaluation:
To evaluate a PyTorch model, follow these steps:
  • Set the model to evaluation mode: 
    Use model.eval() to disable dropout and batch normalization updates, ensuring consistent behavior during inference.
  • Disable gradient calculations: 
    Wrap your evaluation loop with torch.no_grad() to prevent unnecessary gradient computations, saving memory and speeding up the process.
  • Iterate through the test or validation dataset: 
    Feed input data to the model and obtain predictions.
  • Calculate relevant metrics: 
    Compare predictions with ground truth labels to compute metrics like accuracy, precision, recall, F1-score, or loss.
Python
import torchimport torch.nn as nn# Assuming 'model', 'test_loader', and 'criterion' are definedmodel.eval()  # Set model to evaluation modetotal_loss = 0correct_predictions = 0total_samples = 0with torch.no_grad():  # Disable gradient calculations    for inputs, labels in test_loader:        outputs = model(inputs)        loss = criterion(outputs, labels)        total_loss += loss.item()        _, predicted = torch.max(outputs.data, 1)        total_samples += labels.size(0)        correct_predictions += (predicted == labels).sum().item()average_loss = total_loss / len(test_loader)accuracy = correct_predictions / total_samplesprint(f"Test Loss: {average_loss:.4f}, Test Accuracy: {accuracy:.4f}")
2. Saving Models:
The recommended method to save a PyTorch model is to save its state_dict, which contains the learned parameters (weights and biases).
Python
# Assuming 'model' is your trained PyTorch modelPATH = "model_weights.pt"torch.save(model.state_dict(), PATH)
3. Loading Models:
To load a saved model, you must first instantiate the model class with the same architecture as the saved model, then load the state_dict into it.
Python
# Assuming 'MyModelClass' is the class definition of your modelmodel = MyModelClass(*args, **kwargs)  # Instantiate the model with the same architecturePATH = "model_weights.pt"model.load_state_dict(torch.load(PATH))model.eval() # Set to evaluation mode for inference
Saving and Loading for Continued Training:
If you intend to resume training, you should also save the optimizer's state_dict along with the model's state_dict and potentially other relevant information like the epoch number

'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss,}torch.save(checkpoint, 'checkpoint.pt')# Loading a checkpointcheckpoint = torch.load('checkpoint.pt')model.load_state_dict(checkpoint['model_state_dict'])optimizer.load_state_dict(checkpoint['optimizer_state_dict'])epoch = checkpoint['epoch']loss = checkpoint['loss']model.train() # Set to training mode for continued training

Here’s the full Chapter 16: Model Evaluation, Saving, and Loading for your PyTorch textbook-style book — complete with learning objectives, conceptual explanations, practical code examples, and exercises.


Chapter 16: Model Evaluation, Saving, and Loading

Learning Objectives

After completing this chapter, you will be able to:

  • Understand the importance of model evaluation and persistence in deep learning workflows.

  • Apply techniques for checkpointing, saving, and loading PyTorch models.

  • Evaluate model performance using common metrics like accuracy, precision, recall, and F1-score.

  • Analyze model results through confusion matrices and ROC (Receiver Operating Characteristic) curves.

  • Implement real-world evaluation pipelines using PyTorch and scikit-learn.


16.1 Checkpointing and Model Persistence

Model training can take hours or even days. To avoid losing progress, PyTorch provides convenient tools to save and load models and training checkpoints.

Saving and Loading Models

There are two primary methods for saving models in PyTorch:

  1. Saving the entire model

  2. Saving only the model state dictionary

1. Saving and Loading the Entire Model

import torch

# Save the entire model
torch.save(model, 'model.pth')

# Load the entire model
model = torch.load('model.pth')
model.eval()

⚠️ Note: Saving the full model stores architecture + weights, but can be less portable across environments.

2. Saving and Loading State Dictionaries (Preferred Way)

# Save only the model parameters (recommended)
torch.save(model.state_dict(), 'model_state.pth')

# Load model weights
model = MyModel()  # must define the same model class
model.load_state_dict(torch.load('model_state.pth'))
model.eval()

This method ensures flexibility and avoids issues due to class or environment differences.


Checkpointing During Training

Checkpointing is useful when training long models — it allows resuming training after interruptions.

# Save checkpoint
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss
}
torch.save(checkpoint, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

This approach saves both model parameters and optimizer states for exact continuation of training.


Best Practices for Model Persistence

  • Always save state_dict() instead of the full model.

  • Keep training logs and epoch counts with checkpoints.

  • Use versioning (e.g., model_v1.pth, model_v2.pth) for clarity.

  • Store checkpoints in cloud storage or remote servers for safety.


16.2 Performance Evaluation

Evaluating the model’s performance ensures that it generalizes well to unseen data.

Common Evaluation Metrics

Metric Description Formula
Accuracy Fraction of correct predictions (TP + TN) / (TP + TN + FP + FN)
Precision Fraction of relevant positive predictions TP / (TP + FP)
Recall (Sensitivity) Fraction of actual positives identified TP / (TP + FN)
F1-Score Harmonic mean of precision and recall 2 × (Precision × Recall) / (Precision + Recall)

Where:

  • TP: True Positives

  • TN: True Negatives

  • FP: False Positives

  • FN: False Negatives


Implementing Evaluation in PyTorch

import torch
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate(model, dataloader):
    model.eval()
    all_preds, all_labels = [], []

    with torch.no_grad():
        for inputs, labels in dataloader:
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    prec = precision_score(all_labels, all_preds, average='weighted')
    rec = recall_score(all_labels, all_preds, average='weighted')
    f1 = f1_score(all_labels, all_preds, average='weighted')

    print(f"Accuracy: {acc:.4f}")
    print(f"Precision: {prec:.4f}")
    print(f"Recall: {rec:.4f}")
    print(f"F1 Score: {f1:.4f}")

This evaluation loop computes all major metrics over the test dataset.


16.3 Confusion Matrix and ROC Analysis

Understanding how a model makes mistakes is as important as knowing how often it does.

Confusion Matrix

A confusion matrix shows how many predictions fall into each category — helping identify misclassifications.

Predicted \ Actual Positive Negative
Positive True Positive (TP) False Positive (FP)
Negative False Negative (FN) True Negative (TN)

Example: Plotting a Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

y_true = [0, 1, 0, 1, 0, 1]
y_pred = [0, 0, 0, 1, 1, 1]

cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

This visualization provides an immediate sense of which classes are being confused with one another.


ROC (Receiver Operating Characteristic) Curve

The ROC curve illustrates the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) at different thresholds.

  • TPR (Recall) = TP / (TP + FN)

  • FPR = FP / (FP + TN)

The Area Under the ROC Curve (AUC) measures the model’s ability to distinguish between classes:

  • AUC = 1.0: Perfect classifier

  • AUC = 0.5: Random guessing


Example: ROC Curve and AUC Score

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Example binary classification
y_true = [0, 0, 1, 1]
y_scores = [0.1, 0.4, 0.35, 0.8]

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

This plot helps evaluate how well the classifier separates positive and negative classes at various thresholds.


Interpreting ROC and AUC

AUC Range Interpretation
0.9 – 1.0 Excellent
0.8 – 0.9 Good
0.7 – 0.8 Fair
0.6 – 0.7 Poor
0.5 – 0.6 Fail

16.4 Practical Example: Complete Evaluation Pipeline

import torch
from torch.utils.data import DataLoader
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Assume test_dataset and trained model exist
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

model.eval()
all_preds, all_labels, all_probs = [], [], []

with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        probs = torch.softmax(outputs, dim=1)
        _, preds = torch.max(probs, 1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
        all_probs.extend(probs[:, 1].cpu().numpy())  # For binary classification

# Evaluation metrics
print(classification_report(all_labels, all_preds))
print("ROC-AUC Score:", roc_auc_score(all_labels, all_probs))

This pipeline computes detailed metrics and ROC-AUC for a classification model.


16.5 Summary

  • Checkpointing ensures training progress is preserved in case of interruptions.

  • Model persistence enables reproducibility and deployment.

  • Performance evaluation using metrics like accuracy, precision, recall, and F1-score provides insight into model quality.

  • Confusion matrices and ROC curves give a deeper understanding of model behavior beyond simple accuracy.

  • Saving and loading models efficiently helps in resuming training, comparing experiments, and deploying models into production.


Exercises

  1. Conceptual Questions
    a. Explain the difference between saving the model and saving the state dictionary.
    b. Why is checkpointing important during model training?
    c. What does the F1-score represent, and why is it useful?
    d. What does the ROC curve indicate about a model’s performance?

  2. Practical Tasks
    a. Train a small neural network on MNIST or CIFAR-10. Implement periodic checkpointing every 2 epochs.
    b. Write a function to compute and plot a confusion matrix for your trained model.
    c. Compute ROC-AUC for a binary classification task using scikit-learn.
    d. Save the best model (based on validation accuracy) and load it later for testing.

  3. Advanced Challenge
    Implement a callback mechanism that automatically saves the model whenever the validation loss improves — similar to Keras’ ModelCheckpoint.

Comments