RNNs, GRUs, and CTC (MyTorch ep.3)

CMU 11-785 Course: Deep Learning

In the MyTorch series (ep.1-3), I implemented my own deep learning library, a spinoff of PyTorch, from scratch: MLP, CNN, GRU, LSTM. In this assignment (ep.3), I implemented RNNs, GRUs, and CTC from scratch.

Due to academic integrity policies, I would only go over broader concepts covered in this assignment, not the actual code. To demonstrate an RNN, GRU, and LSTM example, I would go over a binary sentiment prediction task using the TensorFlow IMDB dataset.

Recurrent Neural Network (RNN) —

RNN is a type of neural network designed to process sequential data, such as speech or text. RNNs have a feedback loop that allows information to persist from one timestep to the next, allowing them to capture temporal dependencies in the data. In my assignment, I implemented the Elman RNN.

**Elman RNN vs Jordan RNN**

In Elman RNN, the hidden state at the previous timestep is fed back as an input to the current timestep. The output at the current timestep is computed based on the current input and the updated hidden state.
In Jordan RNN, the output of the previous timestep is fed back as an input to the current timestep. The hidden state at the current timestep is computed based on the current input and the previous output. The Jordan RNN architecture is less commonly used than Elman RNN.

Vanishing Gradient Problem —

The vanishing gradient problem is a common issue that arises during the training of RNNs. It occurs when the gradient of the error function with respect to the weights in the lower layers of the network becomes very small, effectively "vanishing" as it is backpropagated through the layers. As a result, the weights in the lower layers may not be updated significantly, and the network may fail to learn long-term dependencies.

The vanishing gradient problem can be addressed in several ways, such as using activation functions that can preserve the gradient, initializing the weights appropriately, and using architectural modifications such as gating mechanisms, as in the case of gated recurrent units (GRUs) and long short-term memory (LSTM) networks.

Gated Recurrent Unit (GRU) —

This is a type of RNN cell architecture that addresses the vanishing gradient problem. GRUs have a gating mechanism similar to LSTMs, but with fewer parameters. GRUs have two gates: the reset gate and the update gate. The reset gate determines how much of the past information should be forgotten, while the update gate.

Long Short-Term Memory (LSTM) —

This is another type of RNN cell architecture that addresses the vanishing gradient problem. LSTMs use a gating mechanism that allows the network to selectively store or discard information at each timestep. The gating mechanism consists of three gates: the input gate, output gate, and forget gate. These gates control the flow of information through the network and allow it to maintain a long-term memory. LSTMs have more parameters than GRUs, but are also more expressive and capable of modeling more complex relationships in the data.

RNN vs GRU vs LSTM —

Architecture

RNN has a simple architecture with a single hidden layer.
GRU has a more complex architecture than RNN, with two gates.
LSTM has a more complex architecture with three gates.

Gates

RNN does not have any gates.
GRU has two gates: reset gate and update gate.
LSTM has three gates: forget gate, input gate, and output gate.

Learning

RNN is prone to vanishing and exploding gradient problems.
GRU is designed to address the vanishing gradient problem of RNN.
LSTM is also designed to address the vanishing gradient problem of RNN.

Memory

RNN has a short-term memory capacity.
GRU has a medium-term memory capacity.
LSTM has a long-term memory capacity.

Training

RNNs can take longer to train than other types of neural networks because of the sequential nature of data processing.
GRUs are faster to train than RNNs, especially for large datasets, because of the gating mechanism.
LSTMs are also faster to train than RNNs, especially for large datasets, because of the gating mechanism.

Use Cases

RNNs are well-suited for time-series data, speech recognition, language modeling, and NLP.
GRUs are commonly used for similar applications as RNNs, but are more effective in capturing long-term dependencies, making them ideal for machine translation, speech recognition, and image captioning.
LSTMs are commonly used for speech recognition, machine translation, and image captioning.

Connectionist Temporal Classification (CTC) —

This is a method for training sequence-to-sequence models, such as speech recognition models. It is often used in tasks such as speech recognition and handwriting recognition, where the input is a sequence of acoustic or visual features (M), and the output is a sequence of labels or characters (N) where M ≥ N.

The main idea behind CTC is to allow the neural network to make predictions without knowing the exact alignment between the input and output sequences. This is achieved by introducing a special "blank" label, which can be inserted between the actual labels in the output sequence. The network is then trained to predict the probabilities of each label, including the blank label, for each position in the input sequence.

This is one of the best articles I’ve found that explains sequence modeling with CTC: https://distill.pub/2017/ctc/

CTC Loss —

During training, the CTC loss function is used to compare the predicted output sequence with the ground truth sequence, taking into account all possible alignments. This involves summing over all possible paths that could generate the ground truth sequence, including paths that include the blank label. The goal is to maximize the probability of the ground truth sequence, while also penalizing "junk" paths that include too many blank labels.

CTC Decoding: Greedy Search and Beam Search —

During decoding, the CTC algorithm is used to convert the output probabilities into a sequence of labels. This involves finding the most likely label sequence that is consistent with the output probabilities, taking into account the blank label and allowing for some level of uncertainty or variability in the alignment between the input and output sequences.

Greedy search is a simple decoding method that involves selecting the label with the highest probability at each time step and appending it to the output sequence. It is fast and computationally efficient, but it may not always find the optimal output sequence.

Beam search, on the other hand, is a more complex decoding method that involves considering multiple candidate output sequences at each time step, rather than just selecting the one with the highest probability. It is more computationally intensive than greedy search, but it can be more effective at finding the optimal output sequence.

This article explains beam and greedy search in more detail: Foundations of NLP Explained Visually: Beam Search, How It Works

Example: RNN, GRU, LSTM for Sentiment Analysis

Here, I would go over a demonstration on the use of RNNs, GRUs, and LSTMs on a sentiment analysis task.

Reference: Implementation of SimpleRNN, GRU, and LSTM Models in Keras and Tensorflow For an NLP Project

Task—

Predict movie ratings from the IMBD Tensorflow dataset (25k train & 25k test) and report the accuracy and loss at each epoch. Each row of this dataset contains text data as expected and the label is either 0 or 1. So, it represents either good sentiment or bad sentiment. Learn more about this dataset here.

Steps —

load the dataset: imdb_reviews
get the train and test set (sentences and labels)
define the model parameters: vocab size, embedding dimension, maximum length, truncation type, and unknown word representation
tokenize the texts for the neural network
build the RNN, GRU, and LSTM model
define the loss (binary cross entropy), optimizer (adam), and metrics (accuracy)
plot the accuracy and loss after each epoch for both the train and validation set


# MODEL PARAMETERS
# if the IMDB dataset has more than 10000 words, extra words will not be used to train the model
vocab_size = 10000 

# size of the vector that will be used to represent each word
embedding_dim = 16

# maximum length of 120 words will be used for each piece of text or to predict a label
max_length = 120

# text will be truncated at the end
trunc_type = 'post'

# if there is an unknown word that will be represented by oov_tok
oov_tok = ""


tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index

# train
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type)

# test
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length)

The tokenizer.texts_to_sequences() method takes a list of strings as input and returns a list of sequences, where each sequence is a list of integers corresponding to the tokens (i.e., words or subwords) in the original string.

Each unique token in the input text is assigned a unique integer index by the tokenizer object, and the corresponding sequence is created by replacing each token in the original text with its integer index.

The resulting sequences of integers can be used as input to a neural network for training or prediction, since neural networks generally require input data to be in the form of numerical arrays rather than text.


# RNN
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,input_length=max_length),
    tf.keras.layers.SimpleRNN(32),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# GRU
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                             input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# LSTM
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
  
# use this below to train each model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
num_epochs = 30
history = model.fit(padded, training_labels_final, 
                  epochs=num_epochs, validation_data = (testing_padded, testing_labels_final))

Insights

After 30 epochs, here are the metrics:

RNN: loss: 0.0148 - accuracy: 0.9955 - val_loss: 1.5834 - val_accuracy: 0.7134
GRU: loss: 0.0046 - accuracy: 0.9984 - val_loss: 1.5316 - val_accuracy: 0.7996
LSTM: loss: 0.0025 - accuracy: 0.9994 - val_loss: 1.6503 - val_accuracy: 0.8082

LSTM has the best training accuracy and validation accuracy overall, next is GRU, then with RNN being the worst performing. Across all three models, the training accuracy is high (close to 100%) but the validation accuracy is lower (70-80%). This means that the model is probably overfitting to the training data.

Overfitting occurs when a model is too complex and captures the noise and random variations in the training data, rather than the underlying patterns that generalize to new, unseen data. This leads to a model that performs well on the training data but generalizes poorly to new data.

Generally, to address overfitting, you can try several techniques, such as reducing the model complexity (e.g., by decreasing the number of trainable parameters), increasing the amount of regularization (e.g., using L1 or L2 regularization), adding more training data to improve the model's ability to generalize to new data, use dropout or early stopping during training.

GitHub Page

RNNs, GRUs, and CTC (MyTorch ep.3)

CMU 11-785 Course: Deep Learning

Example: RNN, GRU, LSTM for Sentiment Analysis

Language Modeling using RNNs

Convolutional Neural Networks (MyTorch ep.2)