Chapter 23: Reinforcement Learning Project

Abstract:

Reinforcement learning (RL) projects involve training an agent to interact with an environment and learn optimal actions through trial and error, aiming to maximize cumulative rewards. These projects can range from simple simulations to complex real-world applications.
Beginner-Friendly Projects:
  • OpenAI Gym Environments: 
    Solving classic control problems like CartPole, MountainCar, or LunarLander using algorithms like Q-learning or Deep Q-Networks (DQNs).
  • Atari Games: 
    Training an agent to play Atari games like Pong or Breakout from pixel inputs using DQNs.
  • Custom Environments with Unity ML-Agents: 
    Creating a simple game or simulation environment and training an RL agent to perform specific tasks within it.
  • AWS DeepRacer: 
    Participating in autonomous racing simulations to train a self-driving car agent.
Intermediate to Advanced Projects:
  • Robotics: 
    Training robots to navigate mazes, perform object manipulation tasks, or learn complex movements in simulated environments like PyBullet or Mujoco.
  • Traffic Light Control: 
    Developing RL agents to optimize traffic flow in simulated or real-world intersections.
  • Resource Management: 
    Applying RL to optimize resource allocation in data centers, energy grids, or manufacturing processes.
  • Gaming AI: 
    Building sophisticated AI agents for complex games like Chess (using OpenSpiel) or StarCraft II (using AIArena).
  • Natural Language Processing (NLP): 
    Using RL for tasks like dialogue generation, text summarization, or question-answering in interactive environments.
  • Finance and Trading: 
    Developing RL agents to make trading decisions in simulated stock markets or for optimal portfolio management.
Key Components of an RL Project:
  • Environment: The simulated or real-world setting where the agent interacts.
  • Agent: The entity that learns and makes decisions.
  • State: The current observation of the environment.
  • Action: The decision made by the agent.
  • Reward: Feedback received by the agent after taking an action, indicating its effectiveness
  • RL Algorithm: The method used by the agent to learn (e.g., Q-learning, SARSA, DQN, PPO, A2C).
Many of these projects can be implemented using Python libraries such as TensorFlow, PyTorch, and OpenAI Gym.

Below is Chapter 23: Reinforcement Learning Project – Training an Agent in OpenAI Gym, Reward Optimization, and Policy Improvement written as a complete, structured book chapter.


**Chapter 23

Reinforcement Learning Project: Training an Agent in OpenAI Gym**

Reinforcement Learning (RL) is a powerful paradigm for training agents that learn by interacting with an environment. Instead of learning from labeled datasets (as in supervised learning), RL agents learn strategies (policies) through rewards and penalties. In this chapter, we walk through a complete RL project using OpenAI Gym, covering environment setup, training pipelines, reward tuning, policy improvement, and practical implementation examples.


23.1 Introduction

Reinforcement Learning aims to train an agent to take sequences of actions that maximize cumulative reward. A complete RL project involves:

  • Understanding the environment (states, actions, transitions)

  • Designing a reward structure

  • Choosing and implementing a learning algorithm

  • Training the agent

  • Monitoring learning curves

  • Iteratively improving policy and reward design

OpenAI Gym environments provide a standardized interface for RL experimentation, making them ideal for learning and prototyping.

In this chapter, we will develop an RL agent for the classic CartPole-v1 environment using Deep Q-Learning (DQN), while also discussing policy gradient techniques conceptually.


23.2 Understanding OpenAI Gym

OpenAI Gym defines an environment through the following:

State (Observation Space)

Describes the current situation the agent observes.
Example in CartPole:

  • Cart position

  • Cart velocity

  • Pole angle

  • Pole angular velocity

Action Space

Discrete or continuous actions the agent can take.

Example:

  • 0 = Move left

  • 1 = Move right

Reward Function

Gym environments typically provide:

  • Positive reward for staying “alive”

  • Episode termination when failure occurs (pole falls or cart moves out of bounds)

Episode

A sequence of steps until the environment reaches a terminal state.


23.3 Setting Up the RL Project

Below is the typical workflow:

  1. Initialize environment

  2. Define neural network policy (for DQN/actor-critic)

  3. Define replay buffer

  4. Set hyperparameters

  5. Train agent

  6. Evaluate agent performance

  7. Tune reward and improve policy


23.4 Implementing a Deep Q-Network (DQN)

DQN approximates Q-values using a neural network:

[
Q_\theta(s, a) \approx Q(s, a)
]

The goal is to minimize the Bellman error:

[
L = \left(Q(s,a) - \left[r + \gamma \max_{a'} Q'(s', a')\right]\right)^2
]

Where (Q') is a target network.


23.5 Building the Agent in PyTorch

23.5.1 Neural Network Model

import gym
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
from collections import deque

class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )
    def forward(self, x):
        return self.layers(x)

23.5.2 Replay Buffer

class ReplayBuffer:
    def __init__(self, capacity=50000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = map(np.array, zip(*batch))
        return states, actions, rewards, next_states, dones

    def __len__(self):
        return len(self.buffer)

23.6 Training Loop

23.6.1 Hyperparameters

env = gym.make("CartPole-v1", render_mode=None)

gamma = 0.99
lr = 1e-3
batch_size = 64
episodes = 500
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01

23.6.2 Initialize Networks

state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

online_net = DQN(state_dim, action_dim)
target_net = DQN(state_dim, action_dim)
target_net.load_state_dict(online_net.state_dict())

optimizer = optim.Adam(online_net.parameters(), lr=lr)
buffer = ReplayBuffer()

23.6.3 Training the Agent

for episode in range(episodes):
    state, _ = env.reset()
    total_reward = 0

    for step in range(2000):  # Max steps per episode
        # ε-greedy action selection
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            with torch.no_grad():
                q_values = online_net(torch.FloatTensor(state))
                action = q_values.argmax().item()

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        buffer.push(state, action, reward, next_state, done)

        state = next_state
        total_reward += reward

        # Train after buffer fills
        if len(buffer) > batch_size:
            states, actions, rewards, next_states, dones = buffer.sample(batch_size)

            states = torch.FloatTensor(states)
            actions = torch.LongTensor(actions)
            rewards = torch.FloatTensor(rewards)
            next_states = torch.FloatTensor(next_states)
            dones = torch.FloatTensor(dones)

            # Compute targets
            next_q_values = target_net(next_states).max(dim=1)[0]
            targets = rewards + gamma * next_q_values * (1 - dones)

            # Predict Q-values
            q_values = online_net(states).gather(1, actions.unsqueeze(1)).squeeze()

            loss = nn.MSELoss()(q_values, targets)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if done:
            break

    # Update ε
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

    # Update target network occasionally
    if episode % 10 == 0:
        target_net.load_state_dict(online_net.state_dict())

    print(f"Episode {episode}, Reward = {total_reward}, Epsilon = {epsilon:.3f}")

23.7 Reward Optimization

Reward engineering is crucial for agent performance.

23.7.1 Reward Shaping

Modify rewards to accelerate learning:

  • Add extra reward for keeping pole more vertical

  • Penalize unnecessary movements

  • Encourage smooth trajectory

Example:

angle = abs(next_state[2])

# Reward shaping
reward = 1.0 - (angle / 0.418)

23.7.2 Avoid Over-Engineering

Too much shaping can:

  • Make learning environment-specific

  • Lead to unintended behaviors

Always validate shaped reward in multiple scenarios.


23.8 Policy Improvement

Policy improvement refers to making the agent’s behavior progressively better.

For DQN, improvement happens when:

  • Q-network is updated to reduce Bellman error

  • ε decreases → more exploitation

  • Target net stabilizes learning

Guarantee (Policy Improvement Theorem)

If Q-values become more accurate, greedy policy improves performance:

[
\pi'(s) = \arg\max_a Q(s,a)
]

[
Q^{\pi'}(s,a) \ge Q^\pi(s,a)
]


23.8.1 Modern Policy Improvement Techniques

1. Double DQN

Reduces overestimation bias.

2. Dueling Networks

Separate value and advantage streams.

3. Prioritized Experience Replay

Sample important transitions more frequently.

4. Policy Gradient Methods

Instead of Q-values:

  • Directly optimize policy parameters

  • Examples: REINFORCE, PPO, A2C, A3C

These methods excel in continuous control (MuJoCo, robotics).


23.9 Evaluating the RL Agent

Key metrics:

  • Episode reward curve

  • Moving average reward

  • Success rate

  • Stability across seeds

Example evaluation loop:

def evaluate(agent, env, episodes=10):
    rewards = []
    for _ in range(episodes):
        state, _ = env.reset()
        total = 0
        done = False
        while not done:
            with torch.no_grad():
                action = agent(torch.FloatTensor(state)).argmax().item()
            state, r, term, trunc, _ = env.step(action)
            done = term or trunc
            total += r
        rewards.append(total)
    return np.mean(rewards)

23.10 Final Notes and Best Practices

  • Start with simple environments

  • Tune hyperparameters carefully

  • Visualize training curves

  • Validate against unseen scenarios

  • Test model robustness (noise, delays)

  • Prefer stable algorithms (PPO, SAC) for real-world tasks


Conclusion

In this chapter, you learned:

  • How to structure a complete RL project using OpenAI Gym

  • Implementing a Deep Q-Network (DQN) from scratch in PyTorch

  • Using replay buffer, target networks, and ε-greedy exploration

  • Techniques for reward optimization and policy improvement

  • How to evaluate and iterate on an RL agent

This project lays the foundation for more advanced RL applications such as robotics, game AI, adaptive control systems, and autonomous vehicles.

Comments