Chapter 9: Recurrent Neural Networks (RNNs) in PyTorch
Abstract:
nn.RNN module for implementing RNNs.- Sequential Data Processing:RNNs excel at tasks involving sequences, such as natural language processing (NLP), speech recognition, and time series prediction, where the order of data points matters.
- Hidden State:Unlike traditional feedforward networks, RNNs have a recurrent connection that feeds the hidden state from the previous time step as an input to the current time step. This allows the network to "remember" past information.
- Unrolling Through Time:An RNN can be visualized as a series of identical network units, one for each time step in the sequence. Each unit receives the current input and the hidden state from the previous unit, producing an output and an updated hidden state.
import torch.nn as nn rnn_layer = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)input_size: The number of features in the input at each time step.hidden_size: The size of the hidden state.num_layers: The number of recurrent layers.batch_first=True: Specifies that the input tensor will have the batch size as its first dimension (e.g.,(batch_size, sequence_length, input_size)).
- Input Data: The input to the
nn.RNNmodule should be a tensor with the shape(batch_size, sequence_length, input_size)(ifbatch_first=True). - Forward Pass:
output, hidden_state = rnn_layer(input_tensor, initial_hidden_state)input_tensor: The input sequence.initial_hidden_state: An optional initial hidden state, typically a tensor of zeros with the shape(num_layers, batch_size, hidden_size). If not provided, it defaults to zeros.output: Contains the output features (hidden states) from all time steps.hidden_state: Contains the hidden state for the last time step.
import torchimport torch.nn as nn# Define parametersinput_size = 10hidden_size = 20num_layers = 1sequence_length = 5batch_size = 3# Create a sample input tensorinput_data = torch.randn(batch_size, sequence_length, input_size)# Initialize the RNN layerrnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)# Forward passoutput, hidden = rnn(input_data)print("Output shape:", output.shape) # (batch_size, sequence_length, hidden_size)print("Hidden state shape:", hidden.shape) # (num_layers, batch_size, hidden_size)Here’s the complete Chapter 9: Recurrent Neural Networks (RNNs) written in structured textbook format — with learning objectives, detailed explanations, examples, and exercises
Chapter 9: Recurrent Neural Networks (RNNs)
Learning Objectives
After studying this chapter, readers will be able to:
-
Understand the principles of sequential data and why Recurrent Neural Networks (RNNs) are suitable for such tasks.
-
Explain the structure and working mechanism of RNNs, LSTMs, and GRUs.
-
Implement RNN-based architectures in PyTorch for text and sequence processing.
-
Build and train a sentiment analysis model using recurrent neural networks.
9.1 Sequential Data and RNN Basics
9.1.1 Introduction to Sequential Data
In many real-world applications, data is sequential — where the order of elements matters.
Examples include:
-
Text: A sentence where word order affects meaning.
-
Speech: Audio signals changing over time.
-
Stock Prices: Sequences of values over days or hours.
-
Sensor Data: Time-series readings from IoT devices.
Traditional neural networks treat inputs as independent, ignoring temporal relationships. Recurrent Neural Networks (RNNs) overcome this by maintaining a memory of previous inputs.
9.1.2 What is a Recurrent Neural Network (RNN)?
An RNN is a type of neural network designed to process sequential data by using loops in its architecture.
It maintains a hidden state that captures information about previous time steps.
At each time step ( t ):
[
h_t = f(W_{ih}x_t + W_{hh}h_{t-1} + b_h)
]
[
y_t = W_{ho}h_t + b_o
]
Where:
-
( x_t ): Input at time step ( t )
-
( h_t ): Hidden state at time step ( t )
-
( y_t ): Output at time step ( t )
-
( W_{ih}, W_{hh}, W_{ho} ): Weight matrices
-
( f ): Nonlinear activation (usually tanh or ReLU)
9.1.3 Unfolding the RNN
An RNN can be “unfolded” through time into a series of layers, each corresponding to one time step, sharing the same parameters.
Figure (conceptual):
x1 → h1 → y1
x2 → h2 → y2
x3 → h3 → y3
where each h_t depends on h_(t-1) and x_t.
9.1.4 Challenges with Vanilla RNNs
While RNNs capture sequential dependencies, they struggle with long-term dependencies due to:
-
Vanishing gradients: Gradients shrink during backpropagation through many time steps.
-
Exploding gradients: Gradients grow exponentially, destabilizing training.
To solve these, LSTM and GRU architectures were introduced.
9.2 LSTM and GRU Architectures
9.2.1 Long Short-Term Memory (LSTM)
LSTM (Long Short-Term Memory) networks are advanced RNNs that can retain information over long sequences.
They use special units called gates to regulate the flow of information.
LSTM Components:
-
Forget Gate: Decides what information to discard.
-
Input Gate: Decides what new information to store.
-
Cell State: Maintains long-term memory.
-
Output Gate: Determines what to output.
Equations:
[
f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)
]
[
i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)
]
[
\tilde{C}t = \tanh(W_C [h{t-1}, x_t] + b_C)
]
[
C_t = f_t * C_{t-1} + i_t * \tilde{C}t
]
[
o_t = \sigma(W_o [h{t-1}, x_t] + b_o)
]
[
h_t = o_t * \tanh(C_t)
]
Conceptual Diagram:
LSTM cell showing input gate, forget gate, cell state, and output gate interacting through the flow of data.
9.2.2 Gated Recurrent Unit (GRU)
GRU (Gated Recurrent Unit) is a simplified version of LSTM, merging the forget and input gates into a single update gate and removing the explicit cell state.
Equations:
[
z_t = \sigma(W_z [h_{t-1}, x_t])
]
[
r_t = \sigma(W_r [h_{t-1}, x_t])
]
[
\tilde{h}t = \tanh(W [r_t * h{t-1}, x_t])
]
[
h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t
]
Advantages of GRU:
-
Fewer parameters → faster training.
-
Comparable performance to LSTM on many tasks.
9.2.3 Comparison between RNN, LSTM, and GRU
| Feature | RNN | LSTM | GRU |
|---|---|---|---|
| Handles Long-Term Dependencies | ❌ Poor | ✅ Excellent | ✅ Good |
| Computation Speed | ✅ Fast | ⚠️ Slower | ✅ Faster |
| Number of Gates | None | 3 (input, forget, output) | 2 (reset, update) |
| Complexity | Low | High | Moderate |
| Use Cases | Short sequences | Complex time dependencies | Balanced tasks |
9.3 Text and Sequence Processing
9.3.1 Why RNNs for Text Data?
In Natural Language Processing (NLP), the meaning of a word depends on context.
For example:
-
“He is banking the fire.” (verb)
-
“He went to the bank.” (noun)
RNNs (and LSTMs/GRUs) are ideal for learning such contextual relationships because they process sequences word-by-word while maintaining a memory of earlier words.
9.3.2 Text Preprocessing Pipeline
Before feeding text to an RNN, the data must be processed:
-
Tokenization: Split sentences into words or subwords.
-
Vocabulary Building: Map each unique token to an integer.
-
Encoding: Convert text to numerical sequences.
-
Padding: Ensure all sequences have equal length.
-
Embedding: Map tokens to dense vectors using
nn.Embedding.
Example in PyTorch:
import torch.nn as nn
embedding = nn.Embedding(num_embeddings=5000, embedding_dim=100)
input_ids = torch.tensor([[1, 23, 45, 6, 0, 0]]) # padded sequence
embedded = embedding(input_ids)
print(embedded.shape) # torch.Size([1, 6, 100])
9.3.3 Building an RNN Model for Text
class RNN_TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
super(RNN_TextClassifier, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = self.embedding(x)
out, hidden = self.rnn(x)
return self.fc(hidden.squeeze(0))
9.4 Sentiment Analysis Example
9.4.1 Dataset Overview
The IMDB Movie Review Dataset is commonly used for binary sentiment classification (positive/negative).
Each review is a sequence of words expressing sentiment.
9.4.2 LSTM for Sentiment Analysis
Step 1: Define the Model
class LSTM_Sentiment(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim, num_layers=2):
super(LSTM_Sentiment, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers, batch_first=True, dropout=0.3)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = self.embedding(x)
_, (hidden, _) = self.lstm(x)
out = self.fc(hidden[-1])
return out
Step 2: Training Loop
import torch.optim as optim
model = LSTM_Sentiment(vocab_size=10000, embed_dim=100, hidden_dim=128, output_dim=2)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
for epoch in range(5):
running_loss = 0.0
for inputs, labels in train_loader:
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}")
Step 3: Evaluation
def predict_sentiment(model, sentence, vocab):
tokens = [vocab[word] for word in sentence.lower().split() if word in vocab]
input_seq = torch.tensor(tokens).unsqueeze(0)
output = model(input_seq)
pred = torch.argmax(output, dim=1).item()
return "Positive" if pred == 1 else "Negative"
9.5 Applications of RNNs
-
Language Modeling and Text Generation
-
Speech Recognition
-
Machine Translation
-
Music Composition
-
Stock Market Forecasting
-
Time-Series Prediction
9.6 Summary
-
RNNs are powerful models for handling sequential or time-dependent data.
-
LSTM and GRU architectures effectively mitigate the vanishing gradient problem.
-
PyTorch provides
nn.RNN,nn.LSTM, andnn.GRUmodules for easy implementation. -
Sentiment analysis demonstrates how RNNs can capture the context and meaning of text.
9.7 Exercises
-
Conceptual Questions
-
Explain how RNNs differ from feedforward networks.
-
What causes vanishing gradients in RNNs?
-
Compare LSTM and GRU in terms of architecture and performance.
-
Why is the order of words crucial in NLP tasks?
-
-
Coding Exercises
-
Modify the
RNN_TextClassifierto use a GRU instead of RNN. -
Add dropout and batch normalization to improve generalization.
-
Implement a character-level text generator using LSTM.
-
Train a GRU-based model on a temperature time-series dataset.
-
-
Mini Project
-
Perform sentiment analysis on Twitter data using pretrained word embeddings (e.g., GloVe or Word2Vec) and a bi-directional LSTM network.
-
Comments
Post a Comment
"Thank you for seeking advice on your career journey! Our team is dedicated to providing personalized guidance on education and success. Please share your specific questions or concerns, and we'll assist you in navigating the path to a fulfilling and successful career."