Building My Own BERT

CMU 11-711 Course: Advanced NLP

Objective —

Develop a minimalist version of BERT (Bidirectional Encoder Representations from Transformers), implementing some important components of the BERT model (self attention, layers, model, optimizer, and classifier) to perform sentence classification on sst dataset and cfimdb dataset.

Project website: https://github.com/cmu-anlp/minbert-assignment

Description —

BERT, developed by Google AI in 2018, is a machine learning model for Natural Language Processing (NLP). It:

Pre-trains on a massive dataset of unlabeled text, like Wikipedia. This allows it to learn the fundamental patterns and relationships within language.
Uses a unique "bidirectional" approach to analyze text. Unlike traditional models, BERT considers the context of both preceding and following words, giving it a deeper understanding of meaning.

From the original BERT paper. The original BERT model was trained on two unsupervised tasks from Wikipedia articles, masked token prediction and next sentence prediction.

Why is BERT important?

BERT's context-aware approach leads to more accurate results in NLP tasks.
Its multi-tasking abilities makes it a valuable tool for a wide range of applications, from search engines and chatbots to automated content analysis and machine translation.
Pre-trained models like BERT save time and resources compared to training models from scratch for each task.
The core BERT model is open-source, allowing developers and researchers to build upon it and create new applications.

Datasets Used in this Project: SST & CFIMDB —

Stanford Sentiment Treebank (SST): Consists of 11,855 single sentences from movie reviews extracted from movie reviews. The dataset was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges. Each phrase has a label of negative, somewhat negative, neutral, somewhat positive, or positive. In this project, you will utilize BERT embeddings to predict these sentiment classification labels. To summarize, for the SST dataset these are the following splits: train (8,544 examples), dev (1,101 examples), test (2,210 examples).
Cornell Movie Review Dataset (CFIMDB): Consists of 2,434 highly polar movie reviews. Each movie review has a binary label of negative or positive. Many of the reviews are longer than one sentence. In this project, we utilize BERT embeddings to predict these sentiment classifications. To summarize, for the CFIMDB dataset these are the following splits: train (1,701 examples), dev (245 examples), test (488 examples).

Parts of BERT I Implemented —

Self attention:

This layer focuses on relationships between words in a sentence.
It uses "queries" to find relevant information from "key-value pairs" and combines them with weights.
I implemented this by projecting queries/keys/values, splitting for multi-head attention, and calculating weighted sums.

Layer:

This layer builds on self-attention with additional processing.
It combines self-attention with normalization, feed-forward layers, and additional normalization.

Model:

This combines layers to create the overall model.
It starts with word and positional embeddings, then stacks several BERT layers.
Finally, it extracts the last hidden state (contextualized word embeddings) and the [CLS] token embedding.

Sentence classifier:

It uses the Bert model to get the [CLS] token embedding, then applies dropout and a linear layer to predict the sentence class.
It adjusts training parameters based on whether I’m pre-training or fine-tuning the model.

Insights and Results —

Building a mini BERT was a valuable learning experience. I got a deeper understanding of transformers, self-attention, and other NLP fundamentals. While it seems obvious, building a Mini BERT still needs sufficient training data. The more examples you feed your model, the better it learns to capture complex relationships and nuances within language, ultimately boosting its accuracy. Both pretraining and fine-tuning rely heavily on quality data. Pretraining requires massive amounts of general text to learn language patterns, while fine-tuning needs labelled data specific to your task (e.g., sentiment labels for sentence classification). The multi-head attention mechanism plays a crucial role in both pretraining and fine-tuning. It allows BERT to understand relationships between words and capture contextual meaning. Also, unlike traditional left-to-right models, BERT's bidirectional approach, where it analyzes words in both directions, takes its time to converge and was slow to train.

The numbers in parentheses are the accuracy values I got, in comarison to the target reference accuracies to the left of it.

Finetuning for SST:

Dev Accuracy: 0.515 (0.51316)
Test Accuracy: 0.526 (0.53348)

Finetuning for CFIMDB:

Dev Accuracy: 0.966 (0.97142)
Test Accuracy: test labels were withheld (0.51229)

CMU 11-711 Course: Advanced NLP

Scientific Named Entity Recognition

Machine-Generated Text Detection