Multi-Layer Perceptron (MyTorch ep.1)

CMU 11-785 Course: Deep Learning

In the MyTorch series (ep.1-3), I implemented my own deep learning library, a spinoff of PyTorch, from scratch: MLP, CNN, GRU, LSTM. In this assignment (ep.1), I implemented multilayer perceptions (MLP) from scratch with 0, 1, and 4 hidden layers.

Due to academic integrity policies, I would only go over broader concepts covered in this assignment, not the actual code. To demonstrate an MLP example, I would go over a simple sentiment analysis task using Sklearn’s Perceptron and MLPClassifier.

Perceptron —

A single neuron model that was a precursor to larger neural networks. It can only understand linear relationships between the input and output data provided. It is a field that investigates how simple models of biological brains can be used to solve difficult computational tasks like the predictive modeling tasks we see in machine learning.

Weights are often initialized to small random values, such as values from 0 to 0.3, although more complex initialization schemes can be used. Larger weights indicate increased complexity.

Multi-Layer Perceptrons (MLP) —

This neural network can have many layers of neurons, and ready to learn more complex patterns.

Linear Layers —

Form the backbone of a multi-layer perceptron model, as they are responsible for performing linear transformations on the input data. These layers learn and apply weight and bias values to the input data, which are then passed through activation functions to generate non-linear outputs. The number and size of linear layers in a neural network determine the complexity and depth of the model, making them a critical component in building a high-performing network.

Activation Functions —

Introduces non-linearity into a neural network by mapping input values to an output range that is suitable for classification or regression tasks. Introducing non-linearity is important in neural networks because many real-world problems require non-linear relationships to be modeled accurately. Without non-linearity, the model would only be capable of learning linear patterns in the input data, which can limit its ability to capture complex relationships between variables. Common activation functions include Sigmoid, ReLU, and Tanh, which can be applied to different layers of the network depending on the specific task at hand.

Sigmoid: Sigmoid is commonly used in binary classification tasks where the output is either 0 or 1. It is useful because it maps the output to a probability range of 0 to 1 and allows the model to learn a decision boundary to separate the two classes.
Tanh (Hyperbolic Tangent): Tanh is similar to sigmoid but maps the output to the range -1 to 1. It is often used in recurrent neural networks (RNNs) due to its ability to capture short-term dependencies and is also used in image classification tasks.
Softmax: Softmax is used in multi-class classification tasks and is used to convert the output of the neural network into a probability distribution over several classes. The softmax function outputs a probability distribution that sums to 1, making it useful in predicting the probabilities of multiple classes.
ReLU (Rectified Linear Unit): ReLU has a derivative of 1 for positive input values and 0 for negative input values. It is best suited for image classification tasks and has been shown to perform well in deep neural networks due to its computational efficiency and ability to mitigate the vanishing gradient* problem providing a non-zero gradient for positive input values.
ELU (Exponential Linear Unit): ELU is an improved version of ReLU that has been shown to perform better in deep neural networks. It mitigates the dead neuron problem that can occur with ReLU by allowing negative values in the output, making it suitable for many deep learning tasks.
GELU (Gaussian Error Linear Units) is a relatively new activation function that has been shown to perform well in deep neural networks. It is a smooth approximation of the rectified linear unit (ReLU) and has been shown to improve the accuracy of language models and other natural language processing (NLP) tasks. One advantage of GELU over ReLU is that it allows for negative input values, which can be useful in some NLP tasks where negative values can occur.

*The vanishing gradient problem occurs when gradients become increasingly small as they are backpropagated through multiple layers in a neural network, making it difficult for the network to learn and update the weights in the earlier layers. This can occur when the derivative of the activation function becomes very small, as is the case with sigmoid and tanh functions, which can result in gradients that are close to zero.

Forward Inference and Backpropagation —

Involves propagating input data through the neural network, applying linear transformations and activation functions to generate output values that can be used for classification or regression. This process is repeated iteratively, with each layer in the network contributing to the final output value.

The process of computing the gradients of the loss function with respect to the weights and biases of the network. These gradients are used to update the parameters during the optimization process, allowing the network to learn from the training data and improve its performance over time.

Criterion Functions: MSE and CELoss —

To evaluate the performance of the neural network by comparing the predicted output values to the ground truth values.

Mean Squared Error (MSE) loss is commonly used in regression tasks, where the goal is to predict a continuous numerical value. It measures the average squared difference between the predicted and actual values, and a lower MSE indicates a better-performing model.

\begin{equation}MSE = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y_i})^2\end{equation}

In this equation, $y_i$ represents the true value of the i-th example in the dataset, $\hat{y_i}$ represents the predicted value for that example, and n is the total number of examples in the dataset. The MSE is calculated by taking the average of the squared differences between the true values and the predicted values.

Cross Entropy loss is used in classification tasks where the goal is to assign each input data point to one of several predefined classes. CE loss measures the difference between the predicted probability distribution and the true distribution, and a lower cross-entropy loss indicates a better-performing model.

\begin{equation}H(p, q) = -\sum\limits_{x}p(x)log(q(x))\end{equation}

In this formula, $p$ represents the true probability distribution of the data, and $q$ represents the predicted probability distribution from the model. The cross-entropy loss measures the difference between the two distributions, with a lower loss indicating that the predicted distribution is closer to the true distribution.

Optimization: Stochastic Gradient Descent (SGD) —

The process of adjusting the weights and biases of the neural network to minimize the loss function. SGD is a popular optimization algorithm that updates the weights and biases based on the gradient of the loss function with respect to the parameters.

\begin{equation}\theta_{t+1} = \theta_{t} - \alpha\nabla f_{i}(\theta_{t})\end{equation}

In this equation, $\theta_{t+1}$ is the updated value of the model parameters at time $t+1$, $\theta_t$ is the current value of the model parameters at time $t$, $\alpha$ is the learning rate, $\nabla f_i(\theta_t)$ is the gradient of the loss function $f_i$ with respect to the model parameters evaluated at the current value $\theta_t$, and $i$ is the index of the current training example in the dataset.

Regularization: Batch Normalization —

Helps to stabilize the training process and improve the performance of the neural network. It normalizes the inputs to each layer of the network, reducing the effects of covariate shift and improving the overall robustness of the model. It helps prevent overfitting and improves accuracy when encountering new data.

Example: MLP for Sentiment Analysis

Here, I would go over a simple demonstration on the use of perceptrons (one neuron) and multi-layer perceptrons (>1 neuron) on a simple sentiment analysis task.

Reference: Multilayer Perceptron Explained with a Real-Life Example and Python Code: Sentiment Analysis

Task—

Guest of a hotel leave a short note about their stay. Classify the ratings as positive or negative and report the accuracy score.

Steps —

split dataset into train and test
turn the short notes (corpus) into a tf-idf array: this provides a way to represent documents as vectors of features that capture the importance of each term in the document relative to its frequency in the corpus
fit/transform the vectorizer for the train and test splits
build perceptron and evaluate accuracy
build MLP and evaluate accuracy
- try 3-5 neurons and 3 layers
- ReLU activation, SGD optimizer, inverse scaling learning rate


def buildPerceptron(x_train, x_test, y_train, y_test):
    '''
    Build a Perceptron and fit the data
    '''
    classifier = Perceptron(random_state=457)
    classifier.fit(x_train, y_train)

    predictions = classifier.predict(x_test)
    score = np.round(metrics.accuracy_score(y_test, predictions), 2)
    print("Mean accuracy of predictions: " + str(score))

def buildMLPerceptron(x_train, x_test, y_train, y_test, num_neurons=5): # more neurons improve predictions
    '''
    Build a Multi-Layer Perceptron and fit the data
    Activation: ReLU
    Optimizer: SGD
    Learning Rate: Inverse Scaling
    '''
    classifier = MLPClassifier(hidden_layer_sizes=num_neurons, max_iter=35, 
                               activation='relu', solver='sgd', verbose=10, 
                               random_state=762, learning_rate='invscaling')
    classifier.fit(x_train, y_train)

    predictions = classifier.predict(x_test)
    score = np.round(metrics.accuracy_score(y_test, predictions), 2)
    print("Mean accuracy of predictions: " + str(score))

Insights —

As seen in the images above, MLPs tend to be more accurate than perceptrons because they can learn non-linear relationships between input and output data. Perceptrons are limited to learning linearly separable patterns, which can be a significant limitation in some cases.

MLPs are more complex than perceptrons because they consist of multiple layers of interconnected neurons, whereas perceptrons have a single layer of neurons. They are capable of learning complex patterns and relationships in data, whereas perceptrons can only learn linearly separable patterns.

MLPs are trained using backpropagation, which is an efficient algorithm for adjusting the weights of the network to minimize the difference between the predicted output and the actual output. Perceptrons are trained using the perceptron learning rule, which adjusts the weights of the network based on the error in the output. However, this rule only works for linearly separable patterns.

GitHub Page

Multi-Layer Perceptron (MyTorch ep.1)

CMU 11-785 Course: Deep Learning

Example: MLP for Sentiment Analysis

Convolutional Neural Networks (MyTorch ep.2)

Deep Learning and Sentiment Analysis to Forecast Stock Market Volatility