Abstract:

Common errors in PyTorch often stem from issues in tensor manipulation, gradient computation, and device management. Debugging these errors typically involves systematic checks and utilizing PyTorch's built-in tools.

Common Errors:

Shape Mismatches:
Occur during operations like matrix multiplication, concatenation, or view/reshape operations when tensor dimensions do not align.
Debugging Tip: Use tensor.shape or tensor.size() to inspect dimensions at various points in your code.
RuntimeError: Trying to backward through the graph a second time:
This happens when attempting to compute gradients for a tensor that has already been freed or has its graph detached.
Debugging Tip: Ensure loss.backward() is called only once per computation graph. If you need to retain the graph for multiple backward calls, use loss.backward(retain_graph=True), but be mindful of memory usage. Alternatively, recompute the relevant operations if possible.
inplace operation errors:
Result from modifying a tensor in-place that is required for gradient computation.
Debugging Tip: Avoid in-place operations (.add_(), .mul_(), etc.) on tensors that require gradients. Use out-of-place operations or clone() to create a copy before modifying.
CUDA Out of Memory (OOM) Errors:
Occur when the GPU's memory is exhausted.
Debugging Tip: Reduce batch size, use smaller models, or consider techniques like gradient accumulation. Utilize torch.cuda.empty_cache() to clear unused memory.
NaN or inf in Gradients/Loss:
Indicates numerical instability in the model.
Debugging Tip: Check for division by zero, log(0), or extremely large/small values. Use torch.autograd.set_detect_anomaly(True) to pinpoint the operation causing the NaNs. Consider gradient clipping.
Incorrect model.train() and model.eval() Usage:
Forgetting to switch between training and evaluation modes can lead to unexpected behavior, especially with layers like Dropout and BatchNorm.
Debugging Tip: Always call model.train() before training and model.eval() before evaluation/inference.

General Debugging Tips:

Start Small: Test with a single batch or a very small dataset to ensure the basic training loop and model forward/backward passes are working.
Print Statements: Use print(tensor.shape) and print(tensor.min(), tensor.max()) to inspect tensor values and shapes at critical points.
PyTorch Profiler: Use torch.profiler to identify performance bottlenecks and memory usage.
torch.autograd.gradcheck: Verify the correctness of custom autograd functions.
GPU vs. CPU: Ensure tensors are on the correct device (.to(device)) and that all necessary components (model, data, loss) are consistent in their device placement.
Logging: Implement comprehensive logging to track model performance, loss values, and other relevant metrics throughout training

Below is Appendix E: Common Errors & Debugging Tips in PyTorch, written fully, clearly, and professionally for inclusion in PyTorch book.

Appendix E: Common Errors & Debugging Tips in PyTorch

Debugging is an essential part of deep learning development. PyTorch offers flexibility and dynamic graphs, but users often encounter shape mismatches, device errors, gradient issues, and training instability. This appendix covers the most common problems along with practical debugging strategies and recommended best practices.

E.1 Overview of Common PyTorch Errors

Below is a high-level classification of errors frequently encountered:

Tensor shape and dimension mismatches
Device mismatch (CPU vs GPU)
Incorrect use of .item(), .detach(), or .numpy()
Autograd-related issues
DataLoader and batching errors
Model not training or loss not decreasing
Exploding/vanishing gradients
Incorrect model saving or loading
Memory issues (GPU out of memory)
Deprecated or incorrect API usage

Each of these categories is explained with examples and solutions.

E.2 Shape and Dimension Errors

One of the most common obstacles in PyTorch is the dreaded shape mismatch.

E.2.1 Example Error

RuntimeError: Expected input[64, 3, 224, 224] to have the same size as ...

E.2.2 Common Causes

Input size does not match model requirements
Incorrect flattening or reshaping
Mismatched number of classes
Wrong feature map size after convolution layers

E.2.3 Debugging Tips

✔ Print shapes at every step

print(x.shape)

✔ Use `torchsummary` or `torchinfo`

from torchinfo import summary
summary(model, input_size=(1, 3, 224, 224))

✔ For linear layers after CNNs

Calculate flatten size:

print(x.view(x.size(0), -1).shape)

✔ Use `assert` for shape constraints

assert x.ndim == 4, "Input must be [B, C, H, W]"

E.3 Device Mismatch Errors (CPU vs GPU)

E.3.1 Example Error

RuntimeError: Expected all tensors to be on the same device...

E.3.2 Causes

Model on GPU but data on CPU
Forgetting .to(device)
Loss function or labels left on CPU

E.3.3 Fix

Standard pattern:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

for x, y in dataloader:
    x, y = x.to(device), y.to(device)

E.3.4 Debug Tip

Inspect devices:

print(x.device, model.fc.weight.device)

E.4 Issues with `.item()`, `.detach()`, `.numpy()`

E.4.1 Common Mistakes

❌ Calling `.numpy()` on GPU tensors

TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() first.

❌ Detaching when you need gradients

loss.backward()  # fails silently if graph was detached

❌ Using `.item()` on non-scalar tensors

E.4.2 Correct Usage

✔ Convert GPU tensor to NumPy

x.cpu().detach().numpy()

✔ Use `.item()` only for scalar values

loss_value = loss.item()

✔ For inference without gradients

with torch.no_grad():
    output = model(x)

E.5 Autograd Errors

E.5.1 Example

RuntimeError: element 0 of tensors does not require grad

Common Causes:

You accidentally used .detach()
You performed operations inside a with torch.no_grad() block
Model parameters were not registered correctly

E.5.2 Debug Tips

✔ Check if parameters require grad

for name, param in model.named_parameters():
    print(name, param.requires_grad)

✔ Ensure layers are assigned as class attributes

# WRONG — not registered as a layer
layer = nn.Linear(10, 5)

# RIGHT
self.layer = nn.Linear(10, 5)

E.6 Dataloader & Batch Errors

E.6.1 Common Problems

Wrong dataset return format
Labels not being integers
Incorrect transforms
Collate function errors
Batch dimension missing

E.6.2 Debug Tips

✔ Ensure dataset returns `(input, label)`

img, label = train_dataset[0]

✔ Confirm label types

print(type(label), label)

✔ Check batch shape

for batch in dataloader:
    x, y = batch
    print(x.shape, y.shape)
    break

✔ If using custom `collate_fn`

Test it manually with 2–3 items.

E.7 Model Not Training / Loss Not Decreasing

E.7.1 Causes

Wrong learning rate
Bad weight initialization
Incorrect loss function
Normalization missing
Gradients exploding or vanishing
Model too simple
Data labeling errors
Using softmax + CrossEntropyLoss (double softmax)

E.7.2 Quick Fix Checklist

✔ Use nn.CrossEntropyLoss() without softmax
✔ Try lower learning rate (1e–3 → 1e–4)
✔ Check if model outputs correct shape
✔ Print some predictions
✔ Visualize data samples
✔ Verify labels and preprocessing
✔ Ensure shuffle=True in training DataLoader
✔ Use gradient clipping:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

E.8 Exploding / Vanishing Gradients

E.8.1 Symptoms

Loss becomes NaN
Model diverges
Accuracy stuck
Gradients extremely large or small

E.8.2 Debug Tips

✔ Check gradient stats

for p in model.parameters():
    print(p.grad.norm())

✔ Apply gradient clipping

clip_grad_norm_(model.parameters(), 5)

✔ Use better initialization

nn.init.xavier_uniform_(layer.weight)

✔ Use normalized activations

BatchNorm, LayerNorm

E.9 Model Saving & Loading Errors

E.9.1 Saver Errors

AttributeError: can't pickle local object

Fix

Define model classes at top level, not inside functions.

E.9.2 Loading Errors

RuntimeError: size mismatch for layer.weight...

Fix

Load only weights:

model.load_state_dict(torch.load("model.pth"), strict=False)

E.9.3 Correct Save/Load Pattern

Save:

torch.save(model.state_dict(), "model.pth")

Load:

model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()

E.10 Out-of-Memory (OOM) Errors

E.10.1 Common Causes

Batch size too large
Too many workers in DataLoader
No with torch.no_grad() during evaluation
Storing tensors accidentally

E.10.2 Solutions

✔ Reduce batch size

✔ Use mixed precision (AMP)

with torch.cuda.amp.autocast():
    output = model(x)

✔ Clear cache

torch.cuda.empty_cache()

✔ Don’t store tensors in lists

Use .detach() if needed.

E.11 Deprecated API Errors

PyTorch evolves rapidly; older tutorials may use deprecated functions.

Examples:

Deprecated	Updated
`Variable()`	Just use tensors
`F.sigmoid`	`torch.sigmoid`
`view(-1)` for flatten	`nn.Flatten()`
`volatile=True`	Use `torch.no_grad()`

E.12 Debugging Tools in PyTorch

E.12.1 torch.autograd.set_detect_anomaly(True)

Useful for tracking the source of NaN or invalid backward passes.

torch.autograd.set_detect_anomaly(True)

E.12.2 torchviz (for graph visualization)

pip install torchviz

E.12.3 TensorBoard

Monitor loss, gradients, histograms:

tensorboard --logdir runs

E.12.4 pdb (Python Debugger)

Insert at any point:

import pdb; pdb.set_trace()

E.13 Summary

This appendix covered the most common PyTorch issues, including:

Shape and dimension mismatches
Device and tensor type errors
Autograd and gradient problems
DataLoader and batching issues
Loss not decreasing or unstable training
Model saving/loading challenges
GPU memory limitations
Deprecated API pitfalls

By combining debugging strategies, assertions, visualization, and PyTorch’s built-in tools, developers can resolve errors quickly and maintain cleaner, more reliable deep learning code.