Chapter 13: Reinforcement Learning with PyTorch
Abstract:
- An agent interacts with an environment, observing its state and taking actions.
- The environment, in response, provides a new state and a reward signal, indicating the quality of the action.
- Popular environments for RL are often provided by libraries like OpenAI Gym or specific simulators like VMAS for multi-agent scenarios.
- Policy Network:A neural network, often implemented using
torch.nn.Module, that takes the current state as input and outputs a probability distribution over possible actions (for policy-based methods like PPO or REINFORCE) or Q-values for each action (for value-based methods like DQN). - Value Network (Optional):Another neural network that estimates the value of a given state or state-action pair, used in actor-critic methods to guide policy updates.
- Optimization:PyTorch's optimizers (e.g.,
torch.optim.Adam,torch.optim.SGD) are used to update the network weights based on calculated losses.
- As the agent interacts with the environment, experiences (state, action, reward, next state, done) are collected.
- For off-policy algorithms like DQN, these experiences are often stored in a replay buffer, allowing the agent to learn from past interactions and break correlations in the data.
- Value-Based Methods (e.g., DQN):The loss is typically calculated by comparing the predicted Q-values with target Q-values (often using a separate target network for stability).
- Policy-Based Methods (e.g., REINFORCE, PPO):The loss is derived from the policy gradient, aiming to increase the probability of actions that lead to higher rewards. TorchRL provides convenient loss modules for algorithms like PPO, simplifying implementation.
- Actor-Critic Methods:Combine elements of both, with separate losses for the policy (actor) and value function (critic).
- The agent repeatedly interacts with the environment, collects data, calculates losses, and updates its policy network (and potentially value network) through backpropagation and optimization.
- This iterative process drives the agent to learn an optimal policy that maximizes cumulative reward over time.
- Dynamic Computation Graph: Facilitates flexible network architectures and debugging.
- GPU Acceleration: Speeds up training of deep neural networks, crucial for complex RL tasks.
- Extensive Ecosystem: Integration with libraries like TorchRL for pre-built components and utilities for common RL algorithms, environments, and data handling (e.g.,
tensordict). - Ease of Use: Python-first design and clear API make it approachable for researchers and developers.
Here’s the complete Chapter 13: Reinforcement Learning with PyTorch, written in a textbook style — including learning objectives, theory, examples, implementation, and exercises.
Chapter 13: Reinforcement Learning with PyTorch
Learning Objectives
After completing this chapter, you will be able to:
-
Understand the fundamental principles of Reinforcement Learning (RL).
-
Differentiate between policy-based and value-based methods.
-
Explain the working of Deep Q-Networks (DQN) and Policy Gradient methods.
-
Implement a simple RL agent using PyTorch.
-
Evaluate and improve RL agents through experience replay and target networks.
13.1 Reinforcement Learning Fundamentals
13.1.1 What is Reinforcement Learning?
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.
It mimics human and animal learning — learning by trial and error.
Key Components of RL:
-
Agent – The learner or decision maker.
-
Environment – The world the agent interacts with.
-
State (s) – A representation of the environment at a specific time.
-
Action (a) – A move or decision taken by the agent.
-
Reward (r) – Feedback from the environment.
-
Policy (Ï€) – The strategy used by the agent to choose actions.
-
Value Function (V) – The expected cumulative reward for being in a state.
-
Q-function (Q(s, a)) – The expected cumulative reward for taking action a in state s.
13.1.2 The RL Process
At each time step t:
-
The agent observes a state sₜ.
-
It selects an action aₜ based on its policy π(aₜ|sₜ).
-
The environment returns a reward rₜ and a new state sₜ₊₁.
-
The agent updates its policy based on the reward signal.
This loop continues until the episode ends (e.g., game over).
13.1.3 Types of RL Methods
| Category | Description | Example Algorithms |
|---|---|---|
| Value-based | Learns the value of states or state-action pairs. | Q-Learning, Deep Q-Networks |
| Policy-based | Directly learns a policy that maps states to actions. | REINFORCE, Actor-Critic |
| Model-based | Learns a model of the environment to plan actions. | Dyna-Q, Model Predictive Control |
13.2 Policy Gradient Methods
13.2.1 Concept
In policy gradient methods, we parameterize the policy π(a|s; θ) using neural network parameters θ and directly optimize these parameters to maximize the expected reward.
The objective function is:
[
J(\theta) = \mathbb{E}{\pi\theta} [R]
]
where ( R ) is the total reward.
We update the parameters using gradient ascent:
[
\theta = \theta + \alpha \nabla_\theta J(\theta)
]
Here, α is the learning rate.
13.2.2 REINFORCE Algorithm
The REINFORCE algorithm is a classic Monte Carlo policy gradient method.
Steps:
-
Run the policy πθ to generate an episode.
-
Compute the return ( G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k ).
-
Update policy parameters using:
[
\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t
]
13.2.3 Advantages and Limitations
Advantages:
-
Can learn stochastic policies.
-
Handles continuous action spaces.
-
Directly optimizes policy.
Limitations:
-
High variance in gradients.
-
Slow convergence.
13.3 Deep Q-Networks (DQN)
13.3.1 From Q-Learning to DQN
Q-Learning is a value-based RL algorithm that learns an action-value function Q(s, a) estimating the expected reward of performing action a in state s.
The Q-update rule:
[
Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]
]
However, this becomes inefficient for large or continuous state spaces.
Deep Q-Networks (DQN) overcome this limitation by approximating Q(s, a) with a deep neural network.
13.3.2 DQN Architecture
-
Input: State vector.
-
Output: Estimated Q-values for each possible action.
-
Loss function:
[
L(\theta) = \left[ r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right]^2
]
where:
-
( \theta ): parameters of the main network.
-
( \theta^- ): parameters of the target network.
13.3.3 Key DQN Techniques
-
Experience Replay – Store experiences (s, a, r, s') in a replay buffer and sample randomly to break correlation.
-
Target Network – A copy of the main network that is updated periodically for stable learning.
-
Epsilon-Greedy Exploration – Choose random action with probability ε for exploration.
13.3.4 Algorithm: Deep Q-Network
-
Initialize Q-network and target network with random weights.
-
For each episode:
-
Initialize state s.
-
For each step:
-
Choose action a using ε-greedy policy.
-
Take action and observe reward r and next state s′.
-
Store (s, a, r, s′) in replay buffer.
-
Sample random mini-batch from buffer.
-
Compute target ( y = r + \gamma \max_{a′} Q(s′, a′; θ^−) ).
-
Update Q-network by minimizing loss ( (y - Q(s, a; θ))^2 ).
-
Every few steps, update target network.
-
-
13.4 Implementing a Basic RL Agent (PyTorch)
13.4.1 Environment Setup
We’ll use OpenAI Gym and PyTorch to build a DQN agent that learns to balance a pole on a cart (CartPole-v1).
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
from collections import deque
13.4.2 Define the Q-Network
class DQN(nn.Module):
def __init__(self, state_size, action_size):
super(DQN, self).__init__()
self.fc1 = nn.Linear(state_size, 64)
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, action_size)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
13.4.3 Experience Replay Memory
class ReplayBuffer:
def __init__(self, capacity=10000):
self.memory = deque(maxlen=capacity)
def add(self, experience):
self.memory.append(experience)
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
def __len__(self):
return len(self.memory)
13.4.4 DQN Training Loop
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
policy_net = DQN(state_size, action_size)
target_net = DQN(state_size, action_size)
target_net.load_state_dict(policy_net.state_dict())
optimizer = optim.Adam(policy_net.parameters(), lr=0.001)
memory = ReplayBuffer(50000)
gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
batch_size = 64
update_target_every = 10
episodes = 500
Training the Agent
for episode in range(episodes):
state = env.reset()[0]
state = torch.tensor(state, dtype=torch.float32)
total_reward = 0
done = False
while not done:
# Epsilon-greedy action selection
if random.random() < epsilon:
action = random.randrange(action_size)
else:
with torch.no_grad():
q_values = policy_net(state)
action = torch.argmax(q_values).item()
next_state, reward, done, _, _ = env.step(action)
next_state = torch.tensor(next_state, dtype=torch.float32)
memory.add((state, action, reward, next_state, done))
state = next_state
total_reward += reward
# Train the network
if len(memory) > batch_size:
batch = memory.sample(batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
states = torch.stack(states)
actions = torch.tensor(actions)
rewards = torch.tensor(rewards)
next_states = torch.stack(next_states)
dones = torch.tensor(dones, dtype=torch.bool)
q_values = policy_net(states).gather(1, actions.unsqueeze(1)).squeeze()
next_q_values = target_net(next_states).max(1)[0]
targets = rewards + gamma * next_q_values * (~dones)
loss = nn.MSELoss()(q_values, targets.detach())
optimizer.zero_grad()
loss.backward()
optimizer.step()
epsilon = max(epsilon_min, epsilon * epsilon_decay)
# Update target network
if episode % update_target_every == 0:
target_net.load_state_dict(policy_net.state_dict())
print(f"Episode {episode}, Total Reward: {total_reward}")
13.4.5 Observations and Performance
-
Initially, the agent performs randomly.
-
Over time, it learns to keep the pole balanced longer.
-
Average reward increases steadily.
-
Convergence speed depends on hyperparameters (γ, α, ε-decay, batch size).
13.5 Summary
In this chapter, you learned:
-
The fundamentals of reinforcement learning — agents, rewards, and policies.
-
The distinction between policy gradient and value-based methods.
-
The concept and structure of Deep Q-Networks (DQN).
-
How to implement a basic RL agent using PyTorch.
Reinforcement learning represents a powerful paradigm for decision-making systems, powering applications like robot control, game AI (e.g., AlphaGo), and autonomous driving.
13.6 Key Terms
| Term | Description |
|---|---|
| Agent | Learner or decision-maker in RL. |
| Policy | Strategy that maps states to actions. |
| Reward | Feedback from environment. |
| Q-value | Expected return for taking an action in a state. |
| Experience Replay | Buffer that stores past experiences for training. |
| Target Network | Stabilizing copy of main network used in DQN. |
13.7 Exercises
-
Define reinforcement learning in your own words and describe its main components.
-
Explain the difference between policy gradient and value-based methods.
-
Write the Q-learning update rule and describe each term.
-
What is the purpose of experience replay and target networks in DQN?
-
Modify the provided DQN code to use a different environment (e.g.,
MountainCar-v0). -
Implement a REINFORCE algorithm in PyTorch for a simple environment.
-
Discuss how exploration vs. exploitation affects RL performance.
Comments
Post a Comment
"Thank you for seeking advice on your career journey! Our team is dedicated to providing personalized guidance on education and success. Please share your specific questions or concerns, and we'll assist you in navigating the path to a fulfilling and successful career."