Attention-based Speech-to-Text Deep Neural Network

CMU 11-785 Course: Deep Learning

Task —

This project is a sequence-to-sequence conversion task to transcribe speech recordings into word sequences spelled out alphabetically. The goal is to build an encoder to effectively extract features from a speech signal, construct a decoder to sequentially spell out the transcription of the audio, and implement an attention mechanism between the decoder and the encoder.

Evaluation —

Levenshtein Distance (LD): number of character modifications needed to change the sequence to the ideal target sequence. The lower the LD, the better!

Challenge —

For speech recognition problems, there may be no obvious correspondence between the input and output sequences. For example, given the speech recording of the phrase “know how,” there is no obvious audio corresponding to the silent characters “k” and “w” in “know” or the blank space character between “know” and “how.” So, we must use one network to process the input and another network to compute the output i.e. encoder-decoder architecture.

Key Concepts —

Sequence-to-sequence conversion

Takes a variable-length sequence of data as input and produces a variable-length sequence of data as output. It is used in a variety of applications such as machine translation, speech recognition, text summarization, and more. The model consists of an encoder network that converts the input sequence into a fixed-length representation, and a decoder network that generates the output sequence based on the encoded input. The model is trained end-to-end using backpropagation through time, allowing it to learn to map the input sequence to the output sequence.

LAS model (listen-attend-spell)

This is the baseline architecture used in this project. The listener is responsible for transforming the input speech signal into a sequence of acoustic features. The attention mechanism is designed to selectively focus on specific portions of the input sequence at each time step of the decoding process. The speller takes the attended features and generates the output sequence, which corresponds to the recognized text. The LAS model has shown to be effective in handling long sequences of speech data, and has achieved state-of-the-art results in several benchmarks for automatic speech recognition and speech-to-text tasks.

Encoder

This is the listener. It would consist of a Pyramidal Bi-LSTM network structure that processes the input data and generates a high level vector representation, or encoding, of the audio.

Attention

The manner in which the decoder derives information from the context input (weighted sum of the sequence of representation vectors computed by the encoder) for conditioning the language model. The attention weights pay “attention” to the most relevant parts of the inputs for each output.

Decoder

This is a conditional language model that depends on the features extracted from the encoder and generates the actual output sequence. It uses an attention mechanism to focus on different parts of the encoding at each step of the generation process.

Teacher-forcing scheduler

A technique used during training in sequence-to-sequence models. It involves feeding the ground truth (true target sequence) as input to the decoder during training instead of the previously predicted output. By gradually decreasing the amount of teacher-forcing, the model learns to rely on its own previous predictions and become less dependent on the ground truth inputs.

Pad-packing

A technique used to efficiently process variable-length input sequences in batched training. In speech-to-text models, audio samples are usually of different lengths, so it is necessary to pad them with zeros to make them the same length.

Levenshtein Distance

Also known as edit distance, is used to measure the difference between two character sequences in speech-to-text models. It determines the minimum number of character insertions, deletions, or substitutions required to transform one sequence into another, taking into account the order and position of characters. The model generates a transcript by mapping an input audio signal to a character sequence, which is then compared to the target sequence using Levenshtein distance to evaluate accuracy.

Dataset —

This project was on the LibriSpeech dataset — a corpus of approximately 1000 hours of read English speech created by collecting recordings of audiobooks from the LibriVox project, which are then segmented and aligned with their corresponding text transcripts. It contains a range of reading styles and accents, and is often used as a benchmark for evaluating automatic speech recognition systems.

Train: data pairs of sequences of audio feature vectors (mfccs) and their transcriptions. The audio are of different lengths
Validation: mfccs and the transcriptions
Test: only the audio feature vectors

Inference —

The process of finding the most likely transcription for the given audio. The speller needs to draw a sample from the language model. Some ways of approaching the “find the most likely output sequence” task are:

Greedy search: this is the easiest decoding for inference. You just draw the character with the highest probability a each time step from the probability distribution you get for your model
Random search: draw outputs randomly from the distribution. This coils sometimes be more probable than greedy outputs
Beam search: this was the method used in the LAS paper. However, we would not need to explicitly evaluate all possible sequences.This search method expand all the K most probable paths until it finds the most likely path that ends in <eos>

Model and Hyperparameters —

Epoch: 10-100
Batch size: 128
Initial learning rate: 0.0001
Optimizer: Adam/AdamW
Scheduler: ReduceLRonPlateau
Mixed Precision: if you are using GPU (Tesla T4, V100, etc) (explained here)
Maximum length of 550 for generating outputs for test and validation
Model: Encoder-Attention-Decoder (39.33M total parameters)

Submission Cutoffs —

high (7LD), medium (14LD), low (21LD), very low (45LD)

My final submission was a Levenshtein distance (LD) of 8.9208 which was very close to the high cutoff 😓

Reflection —

This was the final project of the semester which proved to be a challenging endeavor. It required us to implement numerous classes, functions, and networks from scratch, which was a time-consuming task. Moreover, training the model on a large dataset was extremely demanding, and it was fortunate that we had access to a toy dataset. Despite this, training the model still required a significant amount of time, with each epoch taking between 6-12 minutes. To complete 100 epochs, it took approximately 10 hours excluding several ablations. Additionally, I learned that it was crucial to properly tune some hyperparameters and add more regularization, to achieve optimal performance.

Attention-based Speech-to-Text Deep Neural Network

CMU 11-785 Course: Deep Learning

Movie Streaming Recommendation Service

Face Classification and Verification using CNNs