Abstract:

Model Evaluation, Saving, and Loading with PyTorch

1. Model Evaluation:

To evaluate a PyTorch model, follow these steps:

Set the model to evaluation mode:
Use model.eval() to disable dropout and batch normalization updates, ensuring consistent behavior during inference.
Disable gradient calculations:
Wrap your evaluation loop with torch.no_grad() to prevent unnecessary gradient computations, saving memory and speeding up the process.
Iterate through the test or validation dataset:
Feed input data to the model and obtain predictions.
Calculate relevant metrics:
Compare predictions with ground truth labels to compute metrics like accuracy, precision, recall, F1-score, or loss.

Python

import torch
import torch.nn as nn

# Assuming 'model', 'test_loader', and 'criterion' are defined
model.eval()  # Set model to evaluation mode
total_loss = 0
correct_predictions = 0
total_samples = 0

with torch.no_grad():  # Disable gradient calculations
    for inputs, labels in test_loader:
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        total_loss += loss.item()

        _, predicted = torch.max(outputs.data, 1)
        total_samples += labels.size(0)
        correct_predictions += (predicted == labels).sum().item()

average_loss = total_loss / len(test_loader)
accuracy = correct_predictions / total_samples
print(f"Test Loss: {average_loss:.4f}, Test Accuracy: {accuracy:.4f}")

2. Saving Models:

The recommended method to save a PyTorch model is to save its state_dict, which contains the learned parameters (weights and biases).

Python

# Assuming 'model' is your trained PyTorch model
PATH = "model_weights.pt"
torch.save(model.state_dict(), PATH)

3. Loading Models:

To load a saved model, you must first instantiate the model class with the same architecture as the saved model, then load the state_dict into it.

Python

# Assuming 'MyModelClass' is the class definition of your model
model = MyModelClass(*args, **kwargs)  # Instantiate the model with the same architecture
PATH = "model_weights.pt"
model.load_state_dict(torch.load(PATH))
model.eval() # Set to evaluation mode for inference

Saving and Loading for Continued Training:

If you intend to resume training, you should also save the optimizer's state_dict along with the model's state_dict and potentially other relevant information like the epoch number

'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss,}torch.save(checkpoint, 'checkpoint.pt')# Loading a checkpointcheckpoint = torch.load('checkpoint.pt')model.load_state_dict(checkpoint['model_state_dict'])optimizer.load_state_dict(checkpoint['optimizer_state_dict'])epoch = checkpoint['epoch']loss = checkpoint['loss']model.train() # Set to training mode for continued training

Here’s the full Chapter 16: Model Evaluation, Saving, and Loading for your PyTorch textbook-style book — complete with learning objectives, conceptual explanations, practical code examples, and exercises.

Chapter 16: Model Evaluation, Saving, and Loading

Learning Objectives

After completing this chapter, you will be able to:

Understand the importance of model evaluation and persistence in deep learning workflows.
Apply techniques for checkpointing, saving, and loading PyTorch models.
Evaluate model performance using common metrics like accuracy, precision, recall, and F1-score.
Analyze model results through confusion matrices and ROC (Receiver Operating Characteristic) curves.
Implement real-world evaluation pipelines using PyTorch and scikit-learn.

16.1 Checkpointing and Model Persistence

Model training can take hours or even days. To avoid losing progress, PyTorch provides convenient tools to save and load models and training checkpoints.

Saving and Loading Models

There are two primary methods for saving models in PyTorch:

Saving the entire model
Saving only the model state dictionary

1. Saving and Loading the Entire Model

import torch

# Save the entire model
torch.save(model, 'model.pth')

# Load the entire model
model = torch.load('model.pth')
model.eval()

⚠️ Note: Saving the full model stores architecture + weights, but can be less portable across environments.

2. Saving and Loading State Dictionaries (Preferred Way)

# Save only the model parameters (recommended)
torch.save(model.state_dict(), 'model_state.pth')

# Load model weights
model = MyModel()  # must define the same model class
model.load_state_dict(torch.load('model_state.pth'))
model.eval()

This method ensures flexibility and avoids issues due to class or environment differences.

Checkpointing During Training

Checkpointing is useful when training long models — it allows resuming training after interruptions.

# Save checkpoint
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss
}
torch.save(checkpoint, 'checkpoint.pth')

# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

This approach saves both model parameters and optimizer states for exact continuation of training.

Best Practices for Model Persistence

Always save state_dict() instead of the full model.
Keep training logs and epoch counts with checkpoints.
Use versioning (e.g., model_v1.pth, model_v2.pth) for clarity.
Store checkpoints in cloud storage or remote servers for safety.

16.2 Performance Evaluation

Evaluating the model’s performance ensures that it generalizes well to unseen data.

Common Evaluation Metrics

Metric	Description	Formula
Accuracy	Fraction of correct predictions	(TP + TN) / (TP + TN + FP + FN)
Precision	Fraction of relevant positive predictions	TP / (TP + FP)
Recall (Sensitivity)	Fraction of actual positives identified	TP / (TP + FN)
F1-Score	Harmonic mean of precision and recall	2 × (Precision × Recall) / (Precision + Recall)

Where:

TP: True Positives
TN: True Negatives
FP: False Positives
FN: False Negatives

Implementing Evaluation in PyTorch

import torch
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate(model, dataloader):
    model.eval()
    all_preds, all_labels = [], []

    with torch.no_grad():
        for inputs, labels in dataloader:
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    prec = precision_score(all_labels, all_preds, average='weighted')
    rec = recall_score(all_labels, all_preds, average='weighted')
    f1 = f1_score(all_labels, all_preds, average='weighted')

    print(f"Accuracy: {acc:.4f}")
    print(f"Precision: {prec:.4f}")
    print(f"Recall: {rec:.4f}")
    print(f"F1 Score: {f1:.4f}")

This evaluation loop computes all major metrics over the test dataset.

16.3 Confusion Matrix and ROC Analysis

Understanding how a model makes mistakes is as important as knowing how often it does.

Confusion Matrix

A confusion matrix shows how many predictions fall into each category — helping identify misclassifications.

Predicted \ Actual	Positive	Negative
Positive	True Positive (TP)	False Positive (FP)
Negative	False Negative (FN)	True Negative (TN)

Example: Plotting a Confusion Matrix

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

y_true = [0, 1, 0, 1, 0, 1]
y_pred = [0, 0, 0, 1, 1, 1]

cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()

This visualization provides an immediate sense of which classes are being confused with one another.

ROC (Receiver Operating Characteristic) Curve

The ROC curve illustrates the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) at different thresholds.

TPR (Recall) = TP / (TP + FN)
FPR = FP / (FP + TN)

The Area Under the ROC Curve (AUC) measures the model’s ability to distinguish between classes:

AUC = 1.0: Perfect classifier
AUC = 0.5: Random guessing

Example: ROC Curve and AUC Score

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# Example binary classification
y_true = [0, 0, 1, 1]
y_scores = [0.1, 0.4, 0.35, 0.8]

fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

This plot helps evaluate how well the classifier separates positive and negative classes at various thresholds.

Interpreting ROC and AUC

AUC Range	Interpretation
0.9 – 1.0	Excellent
0.8 – 0.9	Good
0.7 – 0.8	Fair
0.6 – 0.7	Poor
0.5 – 0.6	Fail

16.4 Practical Example: Complete Evaluation Pipeline

import torch
from torch.utils.data import DataLoader
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Assume test_dataset and trained model exist
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

model.eval()
all_preds, all_labels, all_probs = [], [], []

with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        probs = torch.softmax(outputs, dim=1)
        _, preds = torch.max(probs, 1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
        all_probs.extend(probs[:, 1].cpu().numpy())  # For binary classification

# Evaluation metrics
print(classification_report(all_labels, all_preds))
print("ROC-AUC Score:", roc_auc_score(all_labels, all_probs))

This pipeline computes detailed metrics and ROC-AUC for a classification model.

16.5 Summary

Checkpointing ensures training progress is preserved in case of interruptions.
Model persistence enables reproducibility and deployment.
Performance evaluation using metrics like accuracy, precision, recall, and F1-score provides insight into model quality.
Confusion matrices and ROC curves give a deeper understanding of model behavior beyond simple accuracy.
Saving and loading models efficiently helps in resuming training, comparing experiments, and deploying models into production.

Exercises

Conceptual Questions
a. Explain the difference between saving the model and saving the state dictionary.
b. Why is checkpointing important during model training?
c. What does the F1-score represent, and why is it useful?
d. What does the ROC curve indicate about a model’s performance?
Practical Tasks
a. Train a small neural network on MNIST or CIFAR-10. Implement periodic checkpointing every 2 epochs.
b. Write a function to compute and plot a confusion matrix for your trained model.
c. Compute ROC-AUC for a binary classification task using scikit-learn.
d. Save the best model (based on validation accuracy) and load it later for testing.
Advanced Challenge
Implement a callback mechanism that automatically saves the model whenever the validation loss improves — similar to Keras’ ModelCheckpoint.

#Search This #Blog " #Career #Education for #Success - #Discover #Apply #Succeed"

CAREER EDUCATION for SUCCESS "Discover, Apply, Succeed "!