Appendix G: Troubleshooting and Debugging in PyTorch

Abstract:

Troubleshooting and debugging in PyTorch involves identifying and resolving issues that arise during model development, training, and deployment. This can encompass a range of problems, from incorrect model behavior and performance bottlenecks to memory errors and unexpected numerical instability.
Common Troubleshooting Areas:
  • Data Issues:
    • Incorrect data loading or preprocessing: Verify dataset integrity, transformations, and batching.
    • Data starvation: Use tools like nvidia-smi to monitor GPU utilization and identify if the data loader is a bottleneck.
  • Model Issues:
    • Incorrect model architecture or layer implementation: Carefully review the nn.Module definitions and ensure correct parameter handling (e.g., using nn.ModuleList for lists of modules).
    • Weight initialization problems: Investigate the impact of different initialization schemes.
    • Gradient issues: Check for exploding or vanishing gradients (e.g., by logging gradient norms or using torch.nn.utils.clip_grad_norm_).
  • Training Issues:
    • Unstable training: Look for NaN or inf values in loss or gradients, which can indicate numerical instability.
    • Incorrect optimizer or learning rate scheduler configuration.
    • Memory errors (OOM): Reduce batch size, use gradient accumulation, or consider mixed-precision training.
Debugging Techniques and Tools:
  • Print Statements and Logging: 
    Insert print() statements at various points in your code to inspect tensor shapes, values, and intermediate results. Use Python's logging module for more structured output.
  • Python Debuggers (e.g., PDB, VS Code Debugger):
    • import pdb; pdb.set_trace(): Insert this line to set a breakpoint and step through the code, inspecting variables.
    • Integrated IDE Debuggers: Utilize debuggers in IDEs like VS Code for a more visual and interactive debugging experience, including setting breakpoints, watching variables, and stepping through code.
  • PyTorch Profiler:
    • Use torch.profiler.profile to analyze runtime performance, memory usage, and identify bottlenecks in your code, including CPU and GPU operations.
  • torch.autograd.set_detect_anomaly(True):
    • Enable anomaly detection in the autograd engine to catch operations that produce NaN or inf values during backpropagation.
  • Reduced Reproducible Script (R2S):
    • When encountering complex bugs, reduce your code to a minimal, self-contained script that reproduces the issue. This helps isolate the problem and makes it easier to share for assistance.
  • PyTorch/XLA Debugging Tools (for XLA devices):
    • Utilize environment variables like PT_XLA_DEBUG_LEVELXLA_IR_DEBUG, and XLA_SAVE_TENSORS_FILE to gain insights into XLA compilation and execution.
By systematically applying these techniques and tools, you can effectively diagnose and resolve issues in your PyTorch models and workflows

Appendix G: Troubleshooting and Debugging in PyTorch

Troubleshooting and debugging are essential skills when building deep learning models. PyTorch provides flexible tools, but debugging issues related to shape mismatches, device errors, exploding gradients, and training instability can be challenging. This appendix provides a complete guide to identifying, diagnosing, and resolving common PyTorch problems.


1. Common Errors and How to Fix Them

1.1 Shape Mismatch Errors

Shape mismatches occur during:

  • Matrix multiplications

  • Loss calculations

  • Concatenations

  • Layer input/output processing

Common Error Message

RuntimeError: mat1 and mat2 shapes cannot be multiplied

How to Fix

  1. Print tensor shapes:

print(x.shape, w.shape)
  1. Use torch.flatten() or x.view() to reshape:

x = x.view(x.size(0), -1)
  1. Ensure model layers match input size.


1.2 Device Mismatch Errors (CPU vs GPU)

Error Message

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Fix

Move all tensors and model to the same device:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
x = x.to(device)
y = y.to(device)

1.3 Autograd Errors

Error Message

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Fix

Enable gradients:

x = torch.tensor(data, requires_grad=True)

Avoid operations inside torch.no_grad() unless intentional.


1.4 Dataloader Issues

Error Message

RuntimeError: stack expects each tensor to be equal size

Fix

  • Ensure all samples have consistent dimensions.

  • For variable-sized inputs (e.g., text, audio), use a custom collate_fn.


1.5 Loss Does Not Decrease

Possible Reasons

  • Wrong learning rate

  • Incorrect model architecture

  • Bad data preprocessing

  • Vanishing/exploding gradients

Fixes

  • Lower or raise the learning rate:

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
  • Normalize input data.

  • Debug using gradient inspection.


2. Debugging Tools in PyTorch


2.1 Printing Model Summary

from torchsummary import summary
summary(model, input_size=(3, 224, 224))

2.2 Using Hooks to Inspect Activations

Hooks help inspect:

  • Layer inputs

  • Layer outputs

  • Gradients

Forward Hook

def forward_hook(module, input, output):
    print(module, output.shape)

layer = model.conv1
layer.register_forward_hook(forward_hook)

Backward Hook

def backward_hook(module, grad_in, grad_out):
    print(module, grad_out)

layer.register_backward_hook(backward_hook)

2.3 Using torch.autograd.gradcheck()

Useful for verifying gradients in custom layers:

torch.autograd.gradcheck(model, input_tensor.double())

2.4 Using PyTorch Profiler

import torch.profiler as profiler

with profiler.profile(record_shapes=True) as prof:
    output = model(x)

print(prof.key_averages().table(sort_by="cpu_time_total"))

2.5 Using TensorBoard for Debugging

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()
writer.add_graph(model, x)
writer.close()

3. Debugging Training Instability


3.1 Exploding Gradients

Fix: Gradient Clipping

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

3.2 Vanishing Gradients

Fixes

  • Switch to ReLU, LeakyReLU, or GELU

  • Use Residual Connections

  • Use Batch Normalization


3.3 Overfitting

Fixes

  • Add Dropout

  • Add Data Augmentation

  • Use Weight Decay:

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)

3.4 Underfitting

Fixes

  • Increase model complexity

  • Train for more epochs

  • Reduce regularization


4. Checklist for Diagnosing Errors

✔ Check tensor shapes

✔ Check device (CPU/GPU)

✔ Check data type (float32, long, etc.)

✔ Check model layers and outputs

✔ Check loss function expectations

✔ Visualize gradients

✔ Test training on small subset

✔ Look for NaNs or inf values


5. Debugging NaN or Inf Values

5.1 Detect

if torch.isnan(x).any():
    print("NaN detected!")

5.2 Fix

  • Lower the learning rate

  • Use loss scaling (for FP16):

from torch.cuda.amp import GradScaler
scaler = GradScaler()
  • Normalize inputs


6. Tips for Systematic Debugging

6.1 Train on a Small Batch

Helps isolate issues:

subset = next(iter(train_loader))

6.2 Zero Out Weights to Test Flow

for p in model.parameters():
    torch.nn.init.constant_(p, 0)

6.3 Test Forward Pass Only

with torch.no_grad():
    output = model(x)

6.4 Add Assertions

assert x.shape[1] == 3, "Input must have 3 channels"

7. Frequently Encountered PyTorch Bugs (Practical List)

Issue Likely Cause Fix
Loss = NaN LR too high Lower LR
GPU out of memory Large batch size Reduce batch size
Accuracy stuck Wrong labels Check dataset / transforms
Slow training DataLoader bottleneck Set num_workers > 0
Model not improving No gradient flow Check .requires_grad

8. When to Use Debuggers

PDB

import pdb; pdb.set_trace()

VSCode / PyCharm Debugger

Step through training loops.

PyTorch anomaly detection

with torch.autograd.set_detect_anomaly(True):
    loss.backward()

9. Best Practices for Smooth Debugging

  • Write modular code

  • Validate each component separately

  • Save intermediate outputs to inspect

  • Use unit tests for custom layers

  • Log everything (loss, gradients, parameters)

  • Keep experiments reproducible via seeds:

torch.manual_seed(42)

Conclusion

Troubleshooting and debugging are crucial when using PyTorch for deep learning. This appendix provided a comprehensive guide covering:

  • Common errors

  • Debugging tools

  • Gradient and performance debugging

  • Best practices

  • Practical solutions to real-world issues

With these strategies, developers can diagnose and fix PyTorch problems efficiently, leading to faster experimentation and more reliable model development.



Comments