Face Classification and Verification using CNNs

CMU 11-785 Course: Deep Learning

Task

  • Face classifier

    • Build one that can extract feature vectors from face images. Learn facial features (e.g. skin tone, nose size, hair color, etc) and represent them in a fixed-length feature vector called face embedding. After several convolutional layers, pass the vector through a linear layer followed by a Softmax to classify it among N categories. The feature vectors gotten from here would be used in the verification task

  • Verification system

    • Build a system that computes the similarity between feature vectors of images. Two images are input to the system and the output is a similarity score and if they are the same person or not. In this case, it would be a one-to-many comparison, not just two comparisons. Map unknown identities to one of the known identities or to n000000 (“no correspondence” label)

Evaluation

  • Face classification accuracy: ratio of the number of correctly classified images to the total number of images

  • Face verification accuracy: ratio of the number of correctly matched unknown identities to the total number of unknown identities

Key Concepts

Face classification

  • Classifying an input face image into one of several predefined classes such as male or female, young or old, happy or sad, etc. The input image is processed through multiple layers of convolutional and fully connected layers to obtain a probability distribution over the predefined classes. 

Face verification

  • Determine whether two face images are of the same person, without necessarily knowing who the person is. This is often used in biometric systems for identity verification, where a user's identity is verified by comparing their face image with a stored reference image. 

Multi-class face classification

  • The input is the person’s face and the model predicts the class (out of N total classes) that image belongs to

Position invariance

  • Property of deep learning models that allows them to recognize an object regardless of its position in the input image. In other words, the model can identify an object even if it appears in different locations within the image. This property is essential in many computer vision tasks such as object detection, where objects can appear in various locations and scales within an image. Convolutional neural networks (CNNs) are designed to be position invariant by using convolutional layers that apply filters across the entire input image, capturing features regardless of their position.

CNN-based architectures

  • ResNet, short for Residual Network, is a deep convolutional neural network architecture that is widely used in image classification and object recognition tasks. It introduced residual connections, which allow for training of very deep networks without vanishing gradients. ResNet achieves state-of-the-art performance on various image classification benchmarks and is often used as a starting point for transfer learning.

  • MobileNet is a lightweight and efficient neural network architecture designed for mobile and embedded vision applications. It uses depth-wise separable convolutions to significantly reduce the number of parameters and computation while maintaining high accuracy. MobileNet is known for its fast inference speed and low memory requirements, making it suitable for resource-constrained devices.

  • ConvNeXt is a recent CNN architecture that uses inverted bottlenecks inspired by the Swin Transformer, residual blocks, and depthwise convolution, a special case of grouped convolution where the number of groups equals the number of channels.

Dataset

  • For this project, I used a subset of the VGGFace2 dataset, class-balanced, and resized to 224x224 pixels. VGGFace2 is a large-scale face recognition dataset. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession.

  • Classification: 7000 identities

    • Train: images + ID

    • Dev: validate the classification accuracy

    • Test: assign IDs for the images here and submit the result

  • Verification: 1080 unknown identities 

    • Train: images + ID

    • Known: all known identities

    • Unknown_dev: 360 images which the ground truth mapping is given

    • Unknown_test: 720 images

Data Transforms (torchvision)

  • RandomHorizontalFlip: randomly flips an image horizontally, useful for data augmentation and ensuring the model is robust to image orientation variations.

  • RandomPerspective

  • ColorJitter: randomly adjust the brightness, contrast, saturation and hue of an image, useful for data augmentation and improving model robustness to variations in color.

  • RandAugment: applies a sequence of randomly selected image transformations, such as rotation, scaling, and color distortion, to each training image.

  • ToTensor: converts an image to a tensor, useful for converting image data into a format that can be input to the model.

  • RandomErasing: randomly erases a rectangular portion of the image and replaces it with random noise, which encourages the network to be more invariant to occlusion and improves its ability to recognize partially occluded objects.

  • RandomCrop: randomly crops an image to a given size, useful for data augmentation and ensuring the model is robust to input size variations.

  • Resize: resizes an image to a given size, useful for pre-processing data to a fixed size for input to the model.

  • Normalize: normalizes an image with given mean and standard deviation values, useful for pre-processing data to improve model convergence and accuracy.

  • RandomRotation: randomly rotates an image by a given angle, useful for data augmentation and ensuring the model is robust to image rotation variations.

  • RandomResizedCrop: randomly crops and resizes an image to a given size, useful for data augmentation and ensuring the model is robust to input size variations.

Subset of my train dataset after applying some transforms.

Distance Metrics: Cosine Similarity vs Euclidean Distance

  • Cosine Similarity measures the cosine of the angle between two vectors to determine the degree of similarity between a given face image and a set of known faces, and the face is classified based on the highest similarity score. It is less sensitive to the magnitude of the vectors, making it more suitable for high-dimensional feature spaces.

  • Euclidean Distance measures the distance between two points in a multi-dimensional space to measure the distance between the feature vectors of two face images, and if the distance is below a certain threshold, the images are considered a match.

Model and Hyperparameters

  • Epoch: 10-350

  • Batch size: 64-124

  • Initial learning rate: 0.1

  • Transformations: choose from the data transforms above for the training dataset

  • Dropout: stochastic depth

  • Regularization: data transformations, stochastic depth, label smoothing

  • Criterion: cross entropy loss

    • Label smoothing

  • Optimizer: SGD

    • Weight decay

    • Momentum

  • Scheduler: ReduceLRonPlateau, StepLR, MultistepLR, CosineAnnealing

  • Mixed Precision: if you are using GPU (Tesla T4, V100, etc) (explained here)

  • ConvNeXtBlock: conv2d-BatchNorm2d-Conv2d-GELU-Conv2d-StochasticDepth 

  • Model: ConvNeXt-T from the [A ConvNet for the 2020s] paper (3,3,9,3); added groups=channels in ConvNeXtBlock()

    • My final model was 33.19M total parameters

Submission Cutoffs

  • Classification: high (90%), medium (86%), low (82%)

    • I achieved 92.39% accuracy 🥳

  • Verification: high (63%), medium (60%), low (50%)

    • I achieved 64.58% accuracy 🌟

Reflection

While training the face classifier, I faced the issue of the learning rate dropping too quickly, which affected the model's performance. To overcome this, I switched to the ReduceLRonPlateau technique, which dynamically adjusted the learning rate based on the validation loss. Additionally, I found that using checkpoints to continue retraining the model improved the final accuracy. Another key takeaway from this project was the importance of data augmentations for face classification, as they significantly improved the model's performance on the test set.

$\setCounter{0}$
Previous
Previous

Attention-based Speech-to-Text Deep Neural Network

Next
Next

Frame Level Classification of Speech