Appendix E: Common Errors & Debugging Tips in PyTorch
- Shape Mismatches:Occur during operations like matrix multiplication, concatenation, or view/reshape operations when tensor dimensions do not align.
- Debugging Tip: Use
tensor.shapeortensor.size()to inspect dimensions at various points in your code.
- Debugging Tip: Use
RuntimeError: Trying to backward through the graph a second time:This happens when attempting to compute gradients for a tensor that has already been freed or has its graph detached.- Debugging Tip: Ensure
loss.backward()is called only once per computation graph. If you need to retain the graph for multiple backward calls, useloss.backward(retain_graph=True), but be mindful of memory usage. Alternatively, recompute the relevant operations if possible.
- Debugging Tip: Ensure
inplace operationerrors:Result from modifying a tensor in-place that is required for gradient computation.- Debugging Tip: Avoid in-place operations (
.add_(),.mul_(), etc.) on tensors that require gradients. Use out-of-place operations orclone()to create a copy before modifying.
- Debugging Tip: Avoid in-place operations (
- CUDA Out of Memory (OOM) Errors:Occur when the GPU's memory is exhausted.
- Debugging Tip: Reduce batch size, use smaller models, or consider techniques like gradient accumulation. Utilize
torch.cuda.empty_cache()to clear unused memory.
- Debugging Tip: Reduce batch size, use smaller models, or consider techniques like gradient accumulation. Utilize
NaNorinfin Gradients/Loss:Indicates numerical instability in the model.- Debugging Tip: Check for division by zero,
log(0), or extremely large/small values. Usetorch.autograd.set_detect_anomaly(True)to pinpoint the operation causing theNaNs. Consider gradient clipping.
- Debugging Tip: Check for division by zero,
- Incorrect
model.train()andmodel.eval()Usage:Forgetting to switch between training and evaluation modes can lead to unexpected behavior, especially with layers like Dropout and BatchNorm.- Debugging Tip: Always call
model.train()before training andmodel.eval()before evaluation/inference.
- Debugging Tip: Always call
- Start Small: Test with a single batch or a very small dataset to ensure the basic training loop and model forward/backward passes are working.
- Print Statements: Use
print(tensor.shape)andprint(tensor.min(), tensor.max())to inspect tensor values and shapes at critical points. - PyTorch Profiler: Use
torch.profilerto identify performance bottlenecks and memory usage. torch.autograd.gradcheck: Verify the correctness of custom autograd functions.- GPU vs. CPU: Ensure tensors are on the correct device (
.to(device)) and that all necessary components (model, data, loss) are consistent in their device placement. - Logging: Implement comprehensive logging to track model performance, loss values, and other relevant metrics throughout training
Below is Appendix E: Common Errors & Debugging Tips in PyTorch, written fully, clearly, and professionally for inclusion in PyTorch book.
Appendix E: Common Errors & Debugging Tips in PyTorch
Debugging is an essential part of deep learning development. PyTorch offers flexibility and dynamic graphs, but users often encounter shape mismatches, device errors, gradient issues, and training instability. This appendix covers the most common problems along with practical debugging strategies and recommended best practices.
E.1 Overview of Common PyTorch Errors
Below is a high-level classification of errors frequently encountered:
-
Tensor shape and dimension mismatches
-
Device mismatch (CPU vs GPU)
-
Incorrect use of
.item(),.detach(), or.numpy() -
Autograd-related issues
-
DataLoader and batching errors
-
Model not training or loss not decreasing
-
Exploding/vanishing gradients
-
Incorrect model saving or loading
-
Memory issues (GPU out of memory)
-
Deprecated or incorrect API usage
Each of these categories is explained with examples and solutions.
E.2 Shape and Dimension Errors
One of the most common obstacles in PyTorch is the dreaded shape mismatch.
E.2.1 Example Error
RuntimeError: Expected input[64, 3, 224, 224] to have the same size as ...
E.2.2 Common Causes
-
Input size does not match model requirements
-
Incorrect flattening or reshaping
-
Mismatched number of classes
-
Wrong feature map size after convolution layers
E.2.3 Debugging Tips
✔ Print shapes at every step
print(x.shape)
✔ Use torchsummary or torchinfo
from torchinfo import summary
summary(model, input_size=(1, 3, 224, 224))
✔ For linear layers after CNNs
Calculate flatten size:
print(x.view(x.size(0), -1).shape)
✔ Use assert for shape constraints
assert x.ndim == 4, "Input must be [B, C, H, W]"
E.3 Device Mismatch Errors (CPU vs GPU)
E.3.1 Example Error
RuntimeError: Expected all tensors to be on the same device...
E.3.2 Causes
-
Model on GPU but data on CPU
-
Forgetting
.to(device) -
Loss function or labels left on CPU
E.3.3 Fix
Standard pattern:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
for x, y in dataloader:
x, y = x.to(device), y.to(device)
E.3.4 Debug Tip
Inspect devices:
print(x.device, model.fc.weight.device)
E.4 Issues with .item(), .detach(), .numpy()
E.4.1 Common Mistakes
❌ Calling .numpy() on GPU tensors
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() first.
❌ Detaching when you need gradients
loss.backward() # fails silently if graph was detached
❌ Using .item() on non-scalar tensors
E.4.2 Correct Usage
✔ Convert GPU tensor to NumPy
x.cpu().detach().numpy()
✔ Use .item() only for scalar values
loss_value = loss.item()
✔ For inference without gradients
with torch.no_grad():
output = model(x)
E.5 Autograd Errors
E.5.1 Example
RuntimeError: element 0 of tensors does not require grad
Common Causes:
-
You accidentally used
.detach() -
You performed operations inside a
with torch.no_grad()block -
Model parameters were not registered correctly
E.5.2 Debug Tips
✔ Check if parameters require grad
for name, param in model.named_parameters():
print(name, param.requires_grad)
✔ Ensure layers are assigned as class attributes
# WRONG — not registered as a layer
layer = nn.Linear(10, 5)
# RIGHT
self.layer = nn.Linear(10, 5)
E.6 Dataloader & Batch Errors
E.6.1 Common Problems
-
Wrong dataset return format
-
Labels not being integers
-
Incorrect transforms
-
Collate function errors
-
Batch dimension missing
E.6.2 Debug Tips
✔ Ensure dataset returns (input, label)
img, label = train_dataset[0]
✔ Confirm label types
print(type(label), label)
✔ Check batch shape
for batch in dataloader:
x, y = batch
print(x.shape, y.shape)
break
✔ If using custom collate_fn
Test it manually with 2–3 items.
E.7 Model Not Training / Loss Not Decreasing
E.7.1 Causes
-
Wrong learning rate
-
Bad weight initialization
-
Incorrect loss function
-
Normalization missing
-
Gradients exploding or vanishing
-
Model too simple
-
Data labeling errors
-
Using softmax + CrossEntropyLoss (double softmax)
E.7.2 Quick Fix Checklist
✔ Use nn.CrossEntropyLoss() without softmax
✔ Try lower learning rate (1e–3 → 1e–4)
✔ Check if model outputs correct shape
✔ Print some predictions
✔ Visualize data samples
✔ Verify labels and preprocessing
✔ Ensure shuffle=True in training DataLoader
✔ Use gradient clipping:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
E.8 Exploding / Vanishing Gradients
E.8.1 Symptoms
-
Loss becomes NaN
-
Model diverges
-
Accuracy stuck
-
Gradients extremely large or small
E.8.2 Debug Tips
✔ Check gradient stats
for p in model.parameters():
print(p.grad.norm())
✔ Apply gradient clipping
clip_grad_norm_(model.parameters(), 5)
✔ Use better initialization
nn.init.xavier_uniform_(layer.weight)
✔ Use normalized activations
BatchNorm, LayerNorm
E.9 Model Saving & Loading Errors
E.9.1 Saver Errors
AttributeError: can't pickle local object
Fix
Define model classes at top level, not inside functions.
E.9.2 Loading Errors
RuntimeError: size mismatch for layer.weight...
Fix
Load only weights:
model.load_state_dict(torch.load("model.pth"), strict=False)
E.9.3 Correct Save/Load Pattern
Save:
torch.save(model.state_dict(), "model.pth")
Load:
model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()
E.10 Out-of-Memory (OOM) Errors
E.10.1 Common Causes
-
Batch size too large
-
Too many workers in DataLoader
-
No
with torch.no_grad()during evaluation -
Storing tensors accidentally
E.10.2 Solutions
✔ Reduce batch size
✔ Use mixed precision (AMP)
with torch.cuda.amp.autocast():
output = model(x)
✔ Clear cache
torch.cuda.empty_cache()
✔ Don’t store tensors in lists
Use .detach() if needed.
E.11 Deprecated API Errors
PyTorch evolves rapidly; older tutorials may use deprecated functions.
Examples:
| Deprecated | Updated |
|---|---|
Variable() |
Just use tensors |
F.sigmoid |
torch.sigmoid |
view(-1) for flatten |
nn.Flatten() |
volatile=True |
Use torch.no_grad() |
E.12 Debugging Tools in PyTorch
E.12.1 torch.autograd.set_detect_anomaly(True)
Useful for tracking the source of NaN or invalid backward passes.
torch.autograd.set_detect_anomaly(True)
E.12.2 torchviz (for graph visualization)
pip install torchviz
E.12.3 TensorBoard
Monitor loss, gradients, histograms:
tensorboard --logdir runs
E.12.4 pdb (Python Debugger)
Insert at any point:
import pdb; pdb.set_trace()
E.13 Summary
This appendix covered the most common PyTorch issues, including:
-
Shape and dimension mismatches
-
Device and tensor type errors
-
Autograd and gradient problems
-
DataLoader and batching issues
-
Loss not decreasing or unstable training
-
Model saving/loading challenges
-
GPU memory limitations
-
Deprecated API pitfalls
By combining debugging strategies, assertions, visualization, and PyTorch’s built-in tools, developers can resolve errors quickly and maintain cleaner, more reliable deep learning code.
Comments
Post a Comment
"Thank you for seeking advice on your career journey! Our team is dedicated to providing personalized guidance on education and success. Please share your specific questions or concerns, and we'll assist you in navigating the path to a fulfilling and successful career."