Frame Level Classification of Speech

CMU 11-785 Course: Deep Learning

Task —

Predict the phoneme label for each frame in the test set of the speech recordings, which are raw mel spectrogram frames.

Evaluation —

Accuracy of the prediction of the phoneme state labels for each frame in the test set.

Key Concepts —

Phonemes

  • The individual sounds that make up words and are used to differentiate between words. For example, the difference between the words "cat" and "bat" is the phoneme "c" and "b," respectively.

Mel Spectrogram

  • A visual representation of the frequency content of a signal over time. It is commonly used in speech recognition systems as it provides a compact representation of the audio signal as opposed to the raw audio. The mel spectrogram is generated by applying a filter bank to the audio signal and then computing the power spectrum of the resulting filtered signal.

Dataset —

  • Train: audio recordings (utterances in raw mel spectrogram frames) and their frame-level phoneme state (subphoneme) labels. This data comes from articles published in the Wall Street Journal that are read aloud and labeled using the original text. Note: the utterances are of variable length.

  • Test: audio recordings without the phoneme state labels

Project Workflow —

The 42 Phonemes For This Project —


PHONEMES = [
            '[SIL]',   'AA',    'AE',    'AH',    'AO',    'AW',    'AY',  
            'B',     'CH',    'D',     'DH',    'EH',    'ER',    'EY',
            'F',     'G',     'HH',    'IH',    'IY',    'JH',    'K',
            'L',     'M',     'N',     'NG',    'OW',    'OY',    'P',
            'R',     'S',     'SH',    'T',     'TH',    'UH',    'UW',
            'V',     'W',     'Y',     'Z',     'ZH',    '[SOS]', '[EOS]']
  

Dataset Class —

Model and Hyperparameters —

  • Epochs: 5-20

  • Batch size: 1024-4096

  • Context: 20-30

  • Initial learning rate: 0.001

  • Normalization: Cepstral Normalization

  • Criterion: Cross-Entropy Loss because this is a multi-class classification task

  • Optimizer: Adam or AdamW

    • Adam and AdamW are adaptive learning rate optimization algorithms for training deep neural networks. The main difference is in their weight decay regularization technique. In Adam, weight decay is incorporated through the L2 norm of the weight parameters in the optimization step, which can be suboptimal. AdamW uses a decoupled weight decay technique that separately decays the weight parameters before optimization, resulting in improved generalization performance and convergence rates compared to Adam.

  • Scheduler: CosineScheduler, ReduceLRonPlateau (performs better with a high factor and a low patience)

    • CosineScheduler reduces the learning rate gradually by following a cosine curve, which starts with a high learning rate and slowly decreases as the model converges.

    • ReduceLROnPlateau monitors the validation loss of the model and reduces the learning rate when the loss plateaus.

  • Mixed Precision: if you are using GPU (Tesla T4, V100, etc)

    • A technique used to speed up training of deep neural networks on GPUs by using lower-precision (e.g., half-precision) floating point numbers for the forward and backward computations, while maintaining higher-precision (e.g., single-precision) values for the weights and accumulators. This approach reduces the amount of memory used by the network, and increases the speed of calculations, leading to faster training times. However, it requires careful tuning of hyperparameters and is sensitive to numerical stability issues.

  • Model: LinearLayer—BatchNorm1D—RELU/GELU—Dropout (varied number of layers)

    • My final model was 19.12M total parameters.

For my training, I used Google Colab Pro+ which has the Tesla T4 GPU. I also logged my model runs on Weights & Biases (wandb), and saved my best performing model using torch checkpoints.

Submission Cutoffs —

  • High (89%)

  • Very low (65%)

My final submission was an accuracy of 88.01% for the phoneme state labels for each frame in the test set, which was very close to the high cutoff 😓

Reflection —

This was my first time working on a deep learning project like this so I faced a steep learning curve. To keep track of the impact of hyperparameters on the training accuracy, I maintained a Google Sheet that recorded my ablations. Initially, I struggled to comprehend how adjusting one hyperparameter could influence the overall model performance. However, as I delved deeper into the project, I gained a better understanding of the topic. I also learned the importance of training the model on a subset of data before proceeding with the full dataset, as it allowed for efficient debugging. Using a toy dataset (1 min per epoch), followed by a subset of train (8 mins per epoch) and finally the full dataset (25 mins per epoch) ensured quicker debugging and avoided prolonged training periods. Training on the full dataset for 10 epochs was about 4 hours…

$\setCounter{0}$
Previous
Previous

Face Classification and Verification using CNNs

Next
Next

Language Modeling using RNNs