Annexure 6: PyTorch Troubleshooting and Error-Handling Guide.

Abstract:

Below is the complete Annexure 6: PyTorch Troubleshooting and Error-Handling Guide.


**ANNEXURE 6

PyTorch Troubleshooting and Error-Handling Guide**

This annexure provides a comprehensive reference to common PyTorch errors, their causes, and step-by-step solutions. It is designed to help learners, researchers, and developers debug PyTorch code effectively and avoid recurring mistakes.


1. Introduction

Despite PyTorch being a flexible and developer-friendly deep-learning framework, beginners and even advanced users frequently encounter errors—especially related to tensors, shapes, gradients, CUDA, and data handling.

This annexure covers:

  • Common error messages

  • Likely root causes

  • How to fix them

  • Preventive practices

  • Debugging techniques

  • Tools inside and outside PyTorch for diagnosing issues


2. Common PyTorch Errors and Their Solutions


2.1 Shape Mismatch Errors

Error Example

RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x128 and 64x10)

Cause

  • The input tensor shape does not match the expected layer shape.

  • Using the wrong view() or reshape() dimensions.

  • Incorrect flattening before a fully connected layer.

Solution

  1. Print tensor shapes at key steps:

    print(x.shape)
    
  2. Correct the Linear layer input size.

  3. Use nn.Flatten() or:

    x = x.view(x.size(0), -1)
    

Prevention

  • Always verify the feature size using a dummy forward pass.


2.2 CUDA and Device Errors

Error Example

RuntimeError: Expected all tensors to be on the same device, but got CPU and CUDA tensors

Cause

  • Mixing CPU and GPU tensors.

  • Forgetting to move the model or data to CUDA.

Solution

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs, labels = inputs.to(device), labels.to(device)

Prevention

  • Define a to_device() helper function.

  • Wrap model training inside a standard template.


Error Example

RuntimeError: CUDA out of memory

Cause

  • Batch size too large.

  • Model too large.

  • Memory fragmentation.

Solution

  1. Reduce batch size.

  2. Clear cache:

    torch.cuda.empty_cache()
    
  3. Use mixed precision (AMP):

    with torch.cuda.amp.autocast():
    

Prevention

  • Monitor GPU using nvidia-smi.


2.3 Autograd and Gradient Errors

Error Example

RuntimeError: Trying to backward through the graph a second time

Cause

  • Reusing the computation graph unintentionally.

Solution

Use:

loss.backward(retain_graph=True)

OR ensure you are not calling .backward() multiple times.


Error Example

RuntimeError: element 0 of tensors does not require grad

Cause

  • You detached the tensor using .detach() or used with torch.no_grad().

Solution

Remove detach:

x = x.detach()   # remove this if gradients needed

2.4 DataLoader Errors

Error Example

RuntimeError: stack expects each tensor to be equal size

Cause

  • Images or sequences have different sizes.

  • Incorrect padding.

Solution

  • Resize data using transforms:

    transforms.Resize((224, 224))
    
  • Provide a custom collate function for variable-sized inputs.


Error Example

ValueError: num_workers > 0 but torch.multiprocessing.start_process not supported

Cause

  • Windows system or Jupyter notebook issues.

Solution

Set:

num_workers=0

2.5 Model Not Learning / Loss Not Decreasing

Cause

  • Learning rate too high or too low.

  • Optimizer mismatch.

  • Activation mismatch.

  • Incorrect normalization.

Solutions

  1. Lower LR:

    lr=1e-4
    
  2. Check final activation and loss compatibility (e.g., logits + CrossEntropyLoss).

  3. Verify preprocessing of training vs validation data is consistent.


2.6 Saving and Loading Errors

Error Example

RuntimeError: Error(s) in loading state_dict

Cause

  • Model architecture mismatch.

  • Missing layer names.

  • Loading GPU model on CPU.

Solution

Use safe loading:

state = torch.load("model.pth", map_location="cpu")
model.load_state_dict(state, strict=False)

3. Debugging Techniques in PyTorch

3.1 Print Intermediate Shapes

def debug_forward(self, x):
    print("Input:", x.shape)
    x = self.conv1(x); print("After conv1:", x.shape)
    return x

3.2 Use torchviz to Visualize Computational Graph

from torchviz import make_dot
make_dot(loss, params=dict(model.named_parameters()))

3.3 Using pdb or Python Debugger

import pdb; pdb.set_trace()

3.4 Gradient Checking

Check for NaN:

torch.isnan(tensor).any()

3.5 TensorBoard for Debugging

from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
writer.add_graph(model, sample_input)

4. Best Practices to Avoid Errors

✔ Always control the device:

model.to(device)

✔ Use clear training templates

✔ Start with a small model and small data subset

✔ Log shapes and losses frequently

✔ Validate data first before training

✔ Use virtual environments to avoid dependency issues


5. Quick Troubleshooting Table

Error Type Cause Fix
Shape Mismatch Wrong layer dimensions Print shapes, adjust Linear input size
CUDA Error CPU–GPU mixing Use .to(device) uniformly
Out of Memory Large batch/model Reduce batch, use AMP
Autograd Error Multiple backward calls Use retain_graph=True
Dataloader Error Variable-sized data Resize or custom collate
state_dict Error Architecture mismatch Use strict=False

6. Conclusion

This annexure serves as a practical reference to solve and prevent PyTorch’s most commonly encountered issues. Debugging becomes easy when you systematically examine tensor shapes, data flow, device placement, gradients, and model architectures.

By integrating these troubleshooting strategies into your workflow, you will significantly reduce debugging time and improve development efficiency.



Comments