Chapter 13: Reinforcement Learning with PyTorch

Abstract:

Reinforcement Learning (RL) with PyTorch involves leveraging PyTorch's capabilities to build and train agents that learn to make optimal decisions in an environment through trial and error. This process typically involves the following key components and steps:
1. Environment Interaction:
  • An agent interacts with an environment, observing its state and taking actions.
  • The environment, in response, provides a new state and a reward signal, indicating the quality of the action.
  • Popular environments for RL are often provided by libraries like OpenAI Gym or specific simulators like VMAS for multi-agent scenarios.
2. Agent Design with PyTorch:
  • Policy Network: 
    A neural network, often implemented using torch.nn.Module, that takes the current state as input and outputs a probability distribution over possible actions (for policy-based methods like PPO or REINFORCE) or Q-values for each action (for value-based methods like DQN).
  • Value Network (Optional): 
    Another neural network that estimates the value of a given state or state-action pair, used in actor-critic methods to guide policy updates.
  • Optimization: 
    PyTorch's optimizers (e.g., torch.optim.Adamtorch.optim.SGD) are used to update the network weights based on calculated losses.
3. Data Collection and Replay Buffers:
  • As the agent interacts with the environment, experiences (state, action, reward, next state, done) are collected.
  • For off-policy algorithms like DQN, these experiences are often stored in a replay buffer, allowing the agent to learn from past interactions and break correlations in the data.
4. Loss Calculation and Policy Updates:
  • Value-Based Methods (e.g., DQN): 
    The loss is typically calculated by comparing the predicted Q-values with target Q-values (often using a separate target network for stability).
  • Policy-Based Methods (e.g., REINFORCE, PPO): 
    The loss is derived from the policy gradient, aiming to increase the probability of actions that lead to higher rewards. TorchRL provides convenient loss modules for algorithms like PPO, simplifying implementation.
  • Actor-Critic Methods: 
    Combine elements of both, with separate losses for the policy (actor) and value function (critic).
5. Training Loop:
  • The agent repeatedly interacts with the environment, collects data, calculates losses, and updates its policy network (and potentially value network) through backpropagation and optimization.
  • This iterative process drives the agent to learn an optimal policy that maximizes cumulative reward over time.
PyTorch's Advantages for RL:
  • Dynamic Computation Graph: Facilitates flexible network architectures and debugging.
  • GPU Acceleration: Speeds up training of deep neural networks, crucial for complex RL tasks.
  • Extensive Ecosystem: Integration with libraries like TorchRL for pre-built components and utilities for common RL algorithms, environments, and data handling (e.g., tensordict).
  • Ease of Use: Python-first design and clear API make it approachable for researchers and developers.

Here’s the complete Chapter 13: Reinforcement Learning with PyTorch, written in a textbook style — including learning objectives, theory, examples, implementation, and exercises.


Chapter 13: Reinforcement Learning with PyTorch


Learning Objectives

After completing this chapter, you will be able to:

  • Understand the fundamental principles of Reinforcement Learning (RL).

  • Differentiate between policy-based and value-based methods.

  • Explain the working of Deep Q-Networks (DQN) and Policy Gradient methods.

  • Implement a simple RL agent using PyTorch.

  • Evaluate and improve RL agents through experience replay and target networks.


13.1 Reinforcement Learning Fundamentals

13.1.1 What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.

It mimics human and animal learning — learning by trial and error.

Key Components of RL:

  1. Agent – The learner or decision maker.

  2. Environment – The world the agent interacts with.

  3. State (s) – A representation of the environment at a specific time.

  4. Action (a) – A move or decision taken by the agent.

  5. Reward (r) – Feedback from the environment.

  6. Policy (Ï€) – The strategy used by the agent to choose actions.

  7. Value Function (V) – The expected cumulative reward for being in a state.

  8. Q-function (Q(s, a)) – The expected cumulative reward for taking action a in state s.


13.1.2 The RL Process

At each time step t:

  1. The agent observes a state sₜ.

  2. It selects an action aₜ based on its policy π(aₜ|sₜ).

  3. The environment returns a reward rₜ and a new state sₜ₊₁.

  4. The agent updates its policy based on the reward signal.

This loop continues until the episode ends (e.g., game over).


13.1.3 Types of RL Methods

Category Description Example Algorithms
Value-based Learns the value of states or state-action pairs. Q-Learning, Deep Q-Networks
Policy-based Directly learns a policy that maps states to actions. REINFORCE, Actor-Critic
Model-based Learns a model of the environment to plan actions. Dyna-Q, Model Predictive Control

13.2 Policy Gradient Methods

13.2.1 Concept

In policy gradient methods, we parameterize the policy π(a|s; θ) using neural network parameters θ and directly optimize these parameters to maximize the expected reward.

The objective function is:

[
J(\theta) = \mathbb{E}{\pi\theta} [R]
]

where ( R ) is the total reward.

We update the parameters using gradient ascent:

[
\theta = \theta + \alpha \nabla_\theta J(\theta)
]

Here, α is the learning rate.


13.2.2 REINFORCE Algorithm

The REINFORCE algorithm is a classic Monte Carlo policy gradient method.

Steps:

  1. Run the policy πθ to generate an episode.

  2. Compute the return ( G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k ).

  3. Update policy parameters using:

[
\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t
]


13.2.3 Advantages and Limitations

Advantages:

  • Can learn stochastic policies.

  • Handles continuous action spaces.

  • Directly optimizes policy.

Limitations:

  • High variance in gradients.

  • Slow convergence.


13.3 Deep Q-Networks (DQN)

13.3.1 From Q-Learning to DQN

Q-Learning is a value-based RL algorithm that learns an action-value function Q(s, a) estimating the expected reward of performing action a in state s.

The Q-update rule:

[
Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]
]

However, this becomes inefficient for large or continuous state spaces.

Deep Q-Networks (DQN) overcome this limitation by approximating Q(s, a) with a deep neural network.


13.3.2 DQN Architecture

  • Input: State vector.

  • Output: Estimated Q-values for each possible action.

  • Loss function:

[
L(\theta) = \left[ r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right]^2
]

where:

  • ( \theta ): parameters of the main network.

  • ( \theta^- ): parameters of the target network.


13.3.3 Key DQN Techniques

  1. Experience Replay – Store experiences (s, a, r, s') in a replay buffer and sample randomly to break correlation.

  2. Target Network – A copy of the main network that is updated periodically for stable learning.

  3. Epsilon-Greedy Exploration – Choose random action with probability ε for exploration.


13.3.4 Algorithm: Deep Q-Network

  1. Initialize Q-network and target network with random weights.

  2. For each episode:

    • Initialize state s.

    • For each step:

      • Choose action a using ε-greedy policy.

      • Take action and observe reward r and next state s′.

      • Store (s, a, r, s′) in replay buffer.

      • Sample random mini-batch from buffer.

      • Compute target ( y = r + \gamma \max_{a′} Q(s′, a′; θ^−) ).

      • Update Q-network by minimizing loss ( (y - Q(s, a; θ))^2 ).

      • Every few steps, update target network.


13.4 Implementing a Basic RL Agent (PyTorch)

13.4.1 Environment Setup

We’ll use OpenAI Gym and PyTorch to build a DQN agent that learns to balance a pole on a cart (CartPole-v1).

import gym
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
from collections import deque

13.4.2 Define the Q-Network

class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

13.4.3 Experience Replay Memory

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.memory = deque(maxlen=capacity)

    def add(self, experience):
        self.memory.append(experience)

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

13.4.4 DQN Training Loop

env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

policy_net = DQN(state_size, action_size)
target_net = DQN(state_size, action_size)
target_net.load_state_dict(policy_net.state_dict())

optimizer = optim.Adam(policy_net.parameters(), lr=0.001)
memory = ReplayBuffer(50000)

gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
batch_size = 64
update_target_every = 10
episodes = 500

Training the Agent

for episode in range(episodes):
    state = env.reset()[0]
    state = torch.tensor(state, dtype=torch.float32)
    total_reward = 0

    done = False
    while not done:
        # Epsilon-greedy action selection
        if random.random() < epsilon:
            action = random.randrange(action_size)
        else:
            with torch.no_grad():
                q_values = policy_net(state)
                action = torch.argmax(q_values).item()

        next_state, reward, done, _, _ = env.step(action)
        next_state = torch.tensor(next_state, dtype=torch.float32)

        memory.add((state, action, reward, next_state, done))
        state = next_state
        total_reward += reward

        # Train the network
        if len(memory) > batch_size:
            batch = memory.sample(batch_size)
            states, actions, rewards, next_states, dones = zip(*batch)

            states = torch.stack(states)
            actions = torch.tensor(actions)
            rewards = torch.tensor(rewards)
            next_states = torch.stack(next_states)
            dones = torch.tensor(dones, dtype=torch.bool)

            q_values = policy_net(states).gather(1, actions.unsqueeze(1)).squeeze()
            next_q_values = target_net(next_states).max(1)[0]
            targets = rewards + gamma * next_q_values * (~dones)

            loss = nn.MSELoss()(q_values, targets.detach())
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Update target network
    if episode % update_target_every == 0:
        target_net.load_state_dict(policy_net.state_dict())

    print(f"Episode {episode}, Total Reward: {total_reward}")

13.4.5 Observations and Performance

  • Initially, the agent performs randomly.

  • Over time, it learns to keep the pole balanced longer.

  • Average reward increases steadily.

  • Convergence speed depends on hyperparameters (γ, α, ε-decay, batch size).


13.5 Summary

In this chapter, you learned:

  • The fundamentals of reinforcement learning — agents, rewards, and policies.

  • The distinction between policy gradient and value-based methods.

  • The concept and structure of Deep Q-Networks (DQN).

  • How to implement a basic RL agent using PyTorch.

Reinforcement learning represents a powerful paradigm for decision-making systems, powering applications like robot control, game AI (e.g., AlphaGo), and autonomous driving.


13.6 Key Terms

Term Description
Agent Learner or decision-maker in RL.
Policy Strategy that maps states to actions.
Reward Feedback from environment.
Q-value Expected return for taking an action in a state.
Experience Replay Buffer that stores past experiences for training.
Target Network Stabilizing copy of main network used in DQN.

13.7 Exercises

  1. Define reinforcement learning in your own words and describe its main components.

  2. Explain the difference between policy gradient and value-based methods.

  3. Write the Q-learning update rule and describe each term.

  4. What is the purpose of experience replay and target networks in DQN?

  5. Modify the provided DQN code to use a different environment (e.g., MountainCar-v0).

  6. Implement a REINFORCE algorithm in PyTorch for a simple environment.

  7. Discuss how exploration vs. exploitation affects RL performance.

Comments