Face Classification and Verification using CNNs
CMU 11-785 Course: Deep Learning
Task
Face classifier
Build one that can extract feature vectors from face images. Learn facial features (e.g. skin tone, nose size, hair color, etc) and represent them in a fixed-length feature vector called face embedding. After several convolutional layers, pass the vector through a linear layer followed by a Softmax to classify it among N categories. The feature vectors gotten from here would be used in the verification task
Verification system
Build a system that computes the similarity between feature vectors of images. Two images are input to the system and the output is a similarity score and if they are the same person or not. In this case, it would be a one-to-many comparison, not just two comparisons. Map unknown identities to one of the known identities or to n000000 (“no correspondence” label)
Evaluation
Face classification accuracy: ratio of the number of correctly classified images to the total number of images
Face verification accuracy: ratio of the number of correctly matched unknown identities to the total number of unknown identities
Key Concepts
Face classification
Classifying an input face image into one of several predefined classes such as male or female, young or old, happy or sad, etc. The input image is processed through multiple layers of convolutional and fully connected layers to obtain a probability distribution over the predefined classes.
Face verification
Determine whether two face images are of the same person, without necessarily knowing who the person is. This is often used in biometric systems for identity verification, where a user's identity is verified by comparing their face image with a stored reference image.
Multi-class face classification
The input is the person’s face and the model predicts the class (out of N total classes) that image belongs to
Position invariance
Property of deep learning models that allows them to recognize an object regardless of its position in the input image. In other words, the model can identify an object even if it appears in different locations within the image. This property is essential in many computer vision tasks such as object detection, where objects can appear in various locations and scales within an image. Convolutional neural networks (CNNs) are designed to be position invariant by using convolutional layers that apply filters across the entire input image, capturing features regardless of their position.
CNN-based architectures
ResNet, short for Residual Network, is a deep convolutional neural network architecture that is widely used in image classification and object recognition tasks. It introduced residual connections, which allow for training of very deep networks without vanishing gradients. ResNet achieves state-of-the-art performance on various image classification benchmarks and is often used as a starting point for transfer learning.
MobileNet is a lightweight and efficient neural network architecture designed for mobile and embedded vision applications. It uses depth-wise separable convolutions to significantly reduce the number of parameters and computation while maintaining high accuracy. MobileNet is known for its fast inference speed and low memory requirements, making it suitable for resource-constrained devices.
ConvNeXt is a recent CNN architecture that uses inverted bottlenecks inspired by the Swin Transformer, residual blocks, and depthwise convolution, a special case of grouped convolution where the number of groups equals the number of channels.
Dataset
For this project, I used a subset of the VGGFace2 dataset, class-balanced, and resized to 224x224 pixels. VGGFace2 is a large-scale face recognition dataset. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession.
Classification: 7000 identities
Train: images + ID
Dev: validate the classification accuracy
Test: assign IDs for the images here and submit the result
Verification: 1080 unknown identities
Train: images + ID
Known: all known identities
Unknown_dev: 360 images which the ground truth mapping is given
Unknown_test: 720 images
Data Transforms (torchvision)
RandomHorizontalFlip: randomly flips an image horizontally, useful for data augmentation and ensuring the model is robust to image orientation variations.
RandomPerspective
ColorJitter: randomly adjust the brightness, contrast, saturation and hue of an image, useful for data augmentation and improving model robustness to variations in color.
RandAugment: applies a sequence of randomly selected image transformations, such as rotation, scaling, and color distortion, to each training image.
ToTensor: converts an image to a tensor, useful for converting image data into a format that can be input to the model.
RandomErasing: randomly erases a rectangular portion of the image and replaces it with random noise, which encourages the network to be more invariant to occlusion and improves its ability to recognize partially occluded objects.
RandomCrop: randomly crops an image to a given size, useful for data augmentation and ensuring the model is robust to input size variations.
Resize: resizes an image to a given size, useful for pre-processing data to a fixed size for input to the model.
Normalize: normalizes an image with given mean and standard deviation values, useful for pre-processing data to improve model convergence and accuracy.
RandomRotation: randomly rotates an image by a given angle, useful for data augmentation and ensuring the model is robust to image rotation variations.
RandomResizedCrop: randomly crops and resizes an image to a given size, useful for data augmentation and ensuring the model is robust to input size variations.
Distance Metrics: Cosine Similarity vs Euclidean Distance
Cosine Similarity measures the cosine of the angle between two vectors to determine the degree of similarity between a given face image and a set of known faces, and the face is classified based on the highest similarity score. It is less sensitive to the magnitude of the vectors, making it more suitable for high-dimensional feature spaces.
Euclidean Distance measures the distance between two points in a multi-dimensional space to measure the distance between the feature vectors of two face images, and if the distance is below a certain threshold, the images are considered a match.
Model and Hyperparameters
Epoch: 10-350
Batch size: 64-124
Initial learning rate: 0.1
Transformations: choose from the data transforms above for the training dataset
Dropout: stochastic depth
Regularization: data transformations, stochastic depth, label smoothing
Criterion: cross entropy loss
Label smoothing
Optimizer: SGD
Weight decay
Momentum
Scheduler: ReduceLRonPlateau, StepLR, MultistepLR, CosineAnnealing
Mixed Precision: if you are using GPU (Tesla T4, V100, etc) (explained here)
ConvNeXtBlock: conv2d-BatchNorm2d-Conv2d-GELU-Conv2d-StochasticDepth
Model: ConvNeXt-T from the [A ConvNet for the 2020s] paper (3,3,9,3); added groups=channels in ConvNeXtBlock()
My final model was 33.19M total parameters
Submission Cutoffs
Classification: high (90%), medium (86%), low (82%)
I achieved 92.39% accuracy 🥳
Verification: high (63%), medium (60%), low (50%)
I achieved 64.58% accuracy 🌟
Reflection
While training the face classifier, I faced the issue of the learning rate dropping too quickly, which affected the model's performance. To overcome this, I switched to the ReduceLRonPlateau technique, which dynamically adjusted the learning rate based on the validation loss. Additionally, I found that using checkpoints to continue retraining the model improved the final accuracy. Another key takeaway from this project was the importance of data augmentations for face classification, as they significantly improved the model's performance on the test set.