Chapter 16: Model Evaluation, Saving, and Loading with PyTorch
Abstract:
- Set the model to evaluation mode:Use
model.eval()to disable dropout and batch normalization updates, ensuring consistent behavior during inference. - Disable gradient calculations:Wrap your evaluation loop with
torch.no_grad()to prevent unnecessary gradient computations, saving memory and speeding up the process. - Iterate through the test or validation dataset:Feed input data to the model and obtain predictions.
- Calculate relevant metrics:Compare predictions with ground truth labels to compute metrics like accuracy, precision, recall, F1-score, or loss.
import torchimport torch.nn as nn# Assuming 'model', 'test_loader', and 'criterion' are definedmodel.eval() # Set model to evaluation modetotal_loss = 0correct_predictions = 0total_samples = 0with torch.no_grad(): # Disable gradient calculations for inputs, labels in test_loader: outputs = model(inputs) loss = criterion(outputs, labels) total_loss += loss.item() _, predicted = torch.max(outputs.data, 1) total_samples += labels.size(0) correct_predictions += (predicted == labels).sum().item()average_loss = total_loss / len(test_loader)accuracy = correct_predictions / total_samplesprint(f"Test Loss: {average_loss:.4f}, Test Accuracy: {accuracy:.4f}")state_dict, which contains the learned parameters (weights and biases).# Assuming 'model' is your trained PyTorch modelPATH = "model_weights.pt"torch.save(model.state_dict(), PATH)state_dict into it.# Assuming 'MyModelClass' is the class definition of your modelmodel = MyModelClass(*args, **kwargs) # Instantiate the model with the same architecturePATH = "model_weights.pt"model.load_state_dict(torch.load(PATH))model.eval() # Set to evaluation mode for inferencestate_dict along with the model's state_dict and potentially other relevant information like the epoch number'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss,}torch.save(checkpoint, 'checkpoint.pt')# Loading a checkpointcheckpoint = torch.load('checkpoint.pt')model.load_state_dict(checkpoint['model_state_dict'])optimizer.load_state_dict(checkpoint['optimizer_state_dict'])epoch = checkpoint['epoch']loss = checkpoint['loss']model.train() # Set to training mode for continued training
Here’s the full Chapter 16: Model Evaluation, Saving, and Loading for your PyTorch textbook-style book — complete with learning objectives, conceptual explanations, practical code examples, and exercises.
Chapter 16: Model Evaluation, Saving, and Loading
Learning Objectives
After completing this chapter, you will be able to:
-
Understand the importance of model evaluation and persistence in deep learning workflows.
-
Apply techniques for checkpointing, saving, and loading PyTorch models.
-
Evaluate model performance using common metrics like accuracy, precision, recall, and F1-score.
-
Analyze model results through confusion matrices and ROC (Receiver Operating Characteristic) curves.
-
Implement real-world evaluation pipelines using PyTorch and scikit-learn.
16.1 Checkpointing and Model Persistence
Model training can take hours or even days. To avoid losing progress, PyTorch provides convenient tools to save and load models and training checkpoints.
Saving and Loading Models
There are two primary methods for saving models in PyTorch:
-
Saving the entire model
-
Saving only the model state dictionary
1. Saving and Loading the Entire Model
import torch
# Save the entire model
torch.save(model, 'model.pth')
# Load the entire model
model = torch.load('model.pth')
model.eval()
⚠️ Note: Saving the full model stores architecture + weights, but can be less portable across environments.
2. Saving and Loading State Dictionaries (Preferred Way)
# Save only the model parameters (recommended)
torch.save(model.state_dict(), 'model_state.pth')
# Load model weights
model = MyModel() # must define the same model class
model.load_state_dict(torch.load('model_state.pth'))
model.eval()
This method ensures flexibility and avoids issues due to class or environment differences.
Checkpointing During Training
Checkpointing is useful when training long models — it allows resuming training after interruptions.
# Save checkpoint
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}
torch.save(checkpoint, 'checkpoint.pth')
# Load checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
This approach saves both model parameters and optimizer states for exact continuation of training.
Best Practices for Model Persistence
-
Always save state_dict() instead of the full model.
-
Keep training logs and epoch counts with checkpoints.
-
Use versioning (e.g., model_v1.pth, model_v2.pth) for clarity.
-
Store checkpoints in cloud storage or remote servers for safety.
16.2 Performance Evaluation
Evaluating the model’s performance ensures that it generalizes well to unseen data.
Common Evaluation Metrics
| Metric | Description | Formula |
|---|---|---|
| Accuracy | Fraction of correct predictions | (TP + TN) / (TP + TN + FP + FN) |
| Precision | Fraction of relevant positive predictions | TP / (TP + FP) |
| Recall (Sensitivity) | Fraction of actual positives identified | TP / (TP + FN) |
| F1-Score | Harmonic mean of precision and recall | 2 × (Precision × Recall) / (Precision + Recall) |
Where:
-
TP: True Positives
-
TN: True Negatives
-
FP: False Positives
-
FN: False Negatives
Implementing Evaluation in PyTorch
import torch
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
def evaluate(model, dataloader):
model.eval()
all_preds, all_labels = [], []
with torch.no_grad():
for inputs, labels in dataloader:
outputs = model(inputs)
_, preds = torch.max(outputs, 1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
acc = accuracy_score(all_labels, all_preds)
prec = precision_score(all_labels, all_preds, average='weighted')
rec = recall_score(all_labels, all_preds, average='weighted')
f1 = f1_score(all_labels, all_preds, average='weighted')
print(f"Accuracy: {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall: {rec:.4f}")
print(f"F1 Score: {f1:.4f}")
This evaluation loop computes all major metrics over the test dataset.
16.3 Confusion Matrix and ROC Analysis
Understanding how a model makes mistakes is as important as knowing how often it does.
Confusion Matrix
A confusion matrix shows how many predictions fall into each category — helping identify misclassifications.
| Predicted \ Actual | Positive | Negative |
|---|---|---|
| Positive | True Positive (TP) | False Positive (FP) |
| Negative | False Negative (FN) | True Negative (TN) |
Example: Plotting a Confusion Matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
y_true = [0, 1, 0, 1, 0, 1]
y_pred = [0, 0, 0, 1, 1, 1]
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.show()
This visualization provides an immediate sense of which classes are being confused with one another.
ROC (Receiver Operating Characteristic) Curve
The ROC curve illustrates the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) at different thresholds.
-
TPR (Recall) = TP / (TP + FN)
-
FPR = FP / (FP + TN)
The Area Under the ROC Curve (AUC) measures the model’s ability to distinguish between classes:
-
AUC = 1.0: Perfect classifier
-
AUC = 0.5: Random guessing
Example: ROC Curve and AUC Score
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
# Example binary classification
y_true = [0, 0, 1, 1]
y_scores = [0.1, 0.4, 0.35, 0.8]
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()
This plot helps evaluate how well the classifier separates positive and negative classes at various thresholds.
Interpreting ROC and AUC
| AUC Range | Interpretation |
|---|---|
| 0.9 – 1.0 | Excellent |
| 0.8 – 0.9 | Good |
| 0.7 – 0.8 | Fair |
| 0.6 – 0.7 | Poor |
| 0.5 – 0.6 | Fail |
16.4 Practical Example: Complete Evaluation Pipeline
import torch
from torch.utils.data import DataLoader
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
# Assume test_dataset and trained model exist
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)
model.eval()
all_preds, all_labels, all_probs = [], [], []
with torch.no_grad():
for inputs, labels in test_loader:
outputs = model(inputs)
probs = torch.softmax(outputs, dim=1)
_, preds = torch.max(probs, 1)
all_preds.extend(preds.cpu().numpy())
all_labels.extend(labels.cpu().numpy())
all_probs.extend(probs[:, 1].cpu().numpy()) # For binary classification
# Evaluation metrics
print(classification_report(all_labels, all_preds))
print("ROC-AUC Score:", roc_auc_score(all_labels, all_probs))
This pipeline computes detailed metrics and ROC-AUC for a classification model.
16.5 Summary
-
Checkpointing ensures training progress is preserved in case of interruptions.
-
Model persistence enables reproducibility and deployment.
-
Performance evaluation using metrics like accuracy, precision, recall, and F1-score provides insight into model quality.
-
Confusion matrices and ROC curves give a deeper understanding of model behavior beyond simple accuracy.
-
Saving and loading models efficiently helps in resuming training, comparing experiments, and deploying models into production.
Exercises
-
Conceptual Questions
a. Explain the difference between saving the model and saving the state dictionary.
b. Why is checkpointing important during model training?
c. What does the F1-score represent, and why is it useful?
d. What does the ROC curve indicate about a model’s performance? -
Practical Tasks
a. Train a small neural network on MNIST or CIFAR-10. Implement periodic checkpointing every 2 epochs.
b. Write a function to compute and plot a confusion matrix for your trained model.
c. Compute ROC-AUC for a binary classification task using scikit-learn.
d. Save the best model (based on validation accuracy) and load it later for testing. -
Advanced Challenge
Implement a callback mechanism that automatically saves the model whenever the validation loss improves — similar to Keras’ModelCheckpoint.
Comments
Post a Comment
"Thank you for seeking advice on your career journey! Our team is dedicated to providing personalized guidance on education and success. Please share your specific questions or concerns, and we'll assist you in navigating the path to a fulfilling and successful career."