Chapter 9: Recurrent Neural Networks (RNNs) in PyTorch

Abstract:

Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data by maintaining a hidden state that captures information from previous inputs. PyTorch provides a convenient nn.RNN module for implementing RNNs.
Key Concepts:
  • Sequential Data Processing: 
    RNNs excel at tasks involving sequences, such as natural language processing (NLP), speech recognition, and time series prediction, where the order of data points matters.
  • Hidden State: 
    Unlike traditional feedforward networks, RNNs have a recurrent connection that feeds the hidden state from the previous time step as an input to the current time step. This allows the network to "remember" past information.
  • Unrolling Through Time: 
    An RNN can be visualized as a series of identical network units, one for each time step in the sequence. Each unit receives the current input and the hidden state from the previous unit, producing an output and an updated hidden state.
Implementing RNNs in PyTorch:
Import nn.RNN.
Python
    import torch.nn as nn
Instantiate nn.RNN.
Python
    rnn_layer = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
  • input_size: The number of features in the input at each time step.
  • hidden_size: The size of the hidden state.
  • num_layers: The number of recurrent layers.
  • batch_first=True: Specifies that the input tensor will have the batch size as its first dimension (e.g., (batch_size, sequence_length, input_size)).
  • Input Data: The input to the nn.RNN module should be a tensor with the shape (batch_size, sequence_length, input_size) (if batch_first=True).
  • Forward Pass:
Python
    output, hidden_state = rnn_layer(input_tensor, initial_hidden_state)
  • input_tensor: The input sequence.
  • initial_hidden_state: An optional initial hidden state, typically a tensor of zeros with the shape (num_layers, batch_size, hidden_size). If not provided, it defaults to zeros.
  • output: Contains the output features (hidden states) from all time steps.
  • hidden_state: Contains the hidden state for the last time step.
Example (Conceptual):
Python
import torchimport torch.nn as nn# Define parametersinput_size = 10hidden_size = 20num_layers = 1sequence_length = 5batch_size = 3# Create a sample input tensorinput_data = torch.randn(batch_size, sequence_length, input_size)# Initialize the RNN layerrnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)# Forward passoutput, hidden = rnn(input_data)print("Output shape:", output.shape)  # (batch_size, sequence_length, hidden_size)print("Hidden state shape:", hidden.shape) # (num_layers, batch_size, hidden_size)

Here’s the complete Chapter 9: Recurrent Neural Networks (RNNs) written in structured textbook format — with learning objectives, detailed explanations, examples, and exercises 


Chapter 9: Recurrent Neural Networks (RNNs)


Learning Objectives

After studying this chapter, readers will be able to:

  • Understand the principles of sequential data and why Recurrent Neural Networks (RNNs) are suitable for such tasks.

  • Explain the structure and working mechanism of RNNs, LSTMs, and GRUs.

  • Implement RNN-based architectures in PyTorch for text and sequence processing.

  • Build and train a sentiment analysis model using recurrent neural networks.


9.1 Sequential Data and RNN Basics

9.1.1 Introduction to Sequential Data

In many real-world applications, data is sequential — where the order of elements matters.
Examples include:

  • Text: A sentence where word order affects meaning.

  • Speech: Audio signals changing over time.

  • Stock Prices: Sequences of values over days or hours.

  • Sensor Data: Time-series readings from IoT devices.

Traditional neural networks treat inputs as independent, ignoring temporal relationships. Recurrent Neural Networks (RNNs) overcome this by maintaining a memory of previous inputs.


9.1.2 What is a Recurrent Neural Network (RNN)?

An RNN is a type of neural network designed to process sequential data by using loops in its architecture.
It maintains a hidden state that captures information about previous time steps.

At each time step ( t ):

[
h_t = f(W_{ih}x_t + W_{hh}h_{t-1} + b_h)
]
[
y_t = W_{ho}h_t + b_o
]

Where:

  • ( x_t ): Input at time step ( t )

  • ( h_t ): Hidden state at time step ( t )

  • ( y_t ): Output at time step ( t )

  • ( W_{ih}, W_{hh}, W_{ho} ): Weight matrices

  • ( f ): Nonlinear activation (usually tanh or ReLU)


9.1.3 Unfolding the RNN

An RNN can be “unfolded” through time into a series of layers, each corresponding to one time step, sharing the same parameters.

Figure (conceptual):
x1 → h1 → y1
x2 → h2 → y2
x3 → h3 → y3
where each h_t depends on h_(t-1) and x_t.


9.1.4 Challenges with Vanilla RNNs

While RNNs capture sequential dependencies, they struggle with long-term dependencies due to:

  • Vanishing gradients: Gradients shrink during backpropagation through many time steps.

  • Exploding gradients: Gradients grow exponentially, destabilizing training.

To solve these, LSTM and GRU architectures were introduced.


9.2 LSTM and GRU Architectures

9.2.1 Long Short-Term Memory (LSTM)

LSTM (Long Short-Term Memory) networks are advanced RNNs that can retain information over long sequences.
They use special units called gates to regulate the flow of information.

LSTM Components:

  1. Forget Gate: Decides what information to discard.

  2. Input Gate: Decides what new information to store.

  3. Cell State: Maintains long-term memory.

  4. Output Gate: Determines what to output.

Equations:
[
f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)
]
[
i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)
]
[
\tilde{C}t = \tanh(W_C [h{t-1}, x_t] + b_C)
]
[
C_t = f_t * C_{t-1} + i_t * \tilde{C}t
]
[
o_t = \sigma(W_o [h
{t-1}, x_t] + b_o)
]
[
h_t = o_t * \tanh(C_t)
]

Conceptual Diagram:
LSTM cell showing input gate, forget gate, cell state, and output gate interacting through the flow of data.


9.2.2 Gated Recurrent Unit (GRU)

GRU (Gated Recurrent Unit) is a simplified version of LSTM, merging the forget and input gates into a single update gate and removing the explicit cell state.

Equations:
[
z_t = \sigma(W_z [h_{t-1}, x_t])
]
[
r_t = \sigma(W_r [h_{t-1}, x_t])
]
[
\tilde{h}t = \tanh(W [r_t * h{t-1}, x_t])
]
[
h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t
]

Advantages of GRU:

  • Fewer parameters → faster training.

  • Comparable performance to LSTM on many tasks.


9.2.3 Comparison between RNN, LSTM, and GRU

Feature RNN LSTM GRU
Handles Long-Term Dependencies ❌ Poor ✅ Excellent ✅ Good
Computation Speed ✅ Fast ⚠️ Slower ✅ Faster
Number of Gates None 3 (input, forget, output) 2 (reset, update)
Complexity Low High Moderate
Use Cases Short sequences Complex time dependencies Balanced tasks

9.3 Text and Sequence Processing

9.3.1 Why RNNs for Text Data?

In Natural Language Processing (NLP), the meaning of a word depends on context.
For example:

  • “He is banking the fire.” (verb)

  • “He went to the bank.” (noun)

RNNs (and LSTMs/GRUs) are ideal for learning such contextual relationships because they process sequences word-by-word while maintaining a memory of earlier words.


9.3.2 Text Preprocessing Pipeline

Before feeding text to an RNN, the data must be processed:

  1. Tokenization: Split sentences into words or subwords.

  2. Vocabulary Building: Map each unique token to an integer.

  3. Encoding: Convert text to numerical sequences.

  4. Padding: Ensure all sequences have equal length.

  5. Embedding: Map tokens to dense vectors using nn.Embedding.

Example in PyTorch:

import torch.nn as nn

embedding = nn.Embedding(num_embeddings=5000, embedding_dim=100)
input_ids = torch.tensor([[1, 23, 45, 6, 0, 0]])  # padded sequence
embedded = embedding(input_ids)
print(embedded.shape)  # torch.Size([1, 6, 100])

9.3.3 Building an RNN Model for Text

class RNN_TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super(RNN_TextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        x = self.embedding(x)
        out, hidden = self.rnn(x)
        return self.fc(hidden.squeeze(0))

9.4 Sentiment Analysis Example

9.4.1 Dataset Overview

The IMDB Movie Review Dataset is commonly used for binary sentiment classification (positive/negative).
Each review is a sequence of words expressing sentiment.


9.4.2 LSTM for Sentiment Analysis

Step 1: Define the Model

class LSTM_Sentiment(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, num_layers=2):
        super(LSTM_Sentiment, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, batch_first=True, dropout=0.3)
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        x = self.embedding(x)
        _, (hidden, _) = self.lstm(x)
        out = self.fc(hidden[-1])
        return out

Step 2: Training Loop

import torch.optim as optim

model = LSTM_Sentiment(vocab_size=10000, embed_dim=100, hidden_dim=128, output_dim=2)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

for epoch in range(5):
    running_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}")

Step 3: Evaluation

def predict_sentiment(model, sentence, vocab):
    tokens = [vocab[word] for word in sentence.lower().split() if word in vocab]
    input_seq = torch.tensor(tokens).unsqueeze(0)
    output = model(input_seq)
    pred = torch.argmax(output, dim=1).item()
    return "Positive" if pred == 1 else "Negative"

9.5 Applications of RNNs

  • Language Modeling and Text Generation

  • Speech Recognition

  • Machine Translation

  • Music Composition

  • Stock Market Forecasting

  • Time-Series Prediction


9.6 Summary

  • RNNs are powerful models for handling sequential or time-dependent data.

  • LSTM and GRU architectures effectively mitigate the vanishing gradient problem.

  • PyTorch provides nn.RNN, nn.LSTM, and nn.GRU modules for easy implementation.

  • Sentiment analysis demonstrates how RNNs can capture the context and meaning of text.


9.7 Exercises

  1. Conceptual Questions

    • Explain how RNNs differ from feedforward networks.

    • What causes vanishing gradients in RNNs?

    • Compare LSTM and GRU in terms of architecture and performance.

    • Why is the order of words crucial in NLP tasks?

  2. Coding Exercises

    • Modify the RNN_TextClassifier to use a GRU instead of RNN.

    • Add dropout and batch normalization to improve generalization.

    • Implement a character-level text generator using LSTM.

    • Train a GRU-based model on a temperature time-series dataset.

  3. Mini Project

    • Perform sentiment analysis on Twitter data using pretrained word embeddings (e.g., GloVe or Word2Vec) and a bi-directional LSTM network.

Comments