人臉辨識, Face recognition (ArcFace)

CW Lin
5 min readSep 21, 2019

Face recognition includes two parts: face detection and face classification.
Given a picture we need to find out that is there face inside this picture? where are the faces located in this picture? Then we classify who does this face look like.

1. Face detection

https://becominghuman.ai/face-detection-models-and-softwares-42b562a8e151

As I know, there are two major methods in this topic: Haarcascade and MTCNN.
Haarcascade consists of haar feature, adaboost, cascade classifier.
MTCNN is base on neural network, which can do face detection and face alignment.

In my experience, MTCNN is slightly slower than haarcascade but have higher accuracy. Haarcascade has an acceptable recall but poor precision(many false alarm).

With faces that are located from image by face detection, we classify these faces to achieve face recognition.

2. Face classification:

https://github.com/deepinsight/insightface

There are many methods in this part. Basically, face classification comprises two steps: feature extraction and classifier.
If you use neural network, then the algorithm will combine these two steps.

Feature extraction is like PCA, 2DPCA,..etc which can transform face into feature, so that we can apply classification algorithm like KNN, SVM, tree-base algorithm to classify them.

Recently, the powerful neural network structure have been sprung up like mushrooms, that push the accuracy to a higher level even more accurate than human. Therefore, people mostly use deep learning in this task(face recognition).

ArcFace

Here I want to share an algorithm ArcFace: Additive Angular Margin Loss for Deep Face Recognition, 2019 Feb. This method reach stats-of-the-art on IJB-B, IJB-C, AgeDB, LFW, MegaFace dataset.

https://paperswithcode.com/sota/face-identification-on-megaface

The spirit of ArcFace: train on a stricter condition(ArcFace loss), bring a better performance while testing. Let’s see some details:

A normal CNN structure is like figure below:

ArcFace adds angular margin between FC and logit.

regular softmax output + cross-entropy loss is :

we can rewrite the inner product as :

fix the individual weight Wj to 1 by l2 norm and also fix the embedding feature xi to s, then separate the denominator into the ground-true part and others part. The loss function will become:

Finally, we add margin m to ground-true class part:

In the original softmax, it will allocate the embedding feature xi to the class j corresponding to the closest weight Wj.

However, in ArcFace loss (with angular margin m), the ground-true class weight Wj not only need to be the closest weight to the embedding feature xi but also the closest weight while adding the angle m.

https://arxiv.org/pdf/1801.07698.pdf
https://arxiv.org/pdf/1801.07698.pdf

It reminds me of the hinge loss in SVM. Hinge loss adds the margin to the linear output, however ArcFace adds the margin to the angle inside the inner product of softmax.

hinge loss: not only need to predict correctly but also need to predict correctly enough or you will get penalty!

The CNN network structure in the paper is ResNet100, ResNet50 with 512 dimensional features in FC. They also have published their model on the model zoo of their github (MXNET model) so that you don’t need to train actually!!

With trained model, we only need to embed the face to the FC which I call it face space that can depict the feature of face very well. Closer in the face space (with cosine distance), more similar of human face.

image that every one has their own turf in the high dimensional sphere

In the end, author discussed that: Is the 512-d hypersphere space large enough to hold large-scale identities?

the answer is ”YES”. High dimensional space is so sparse that the distance between each individual decreases so slowly when class number increases exponentially.

https://arxiv.org/pdf/1801.07698.pdf

I apply their model published in the model zoo and it really amazes me with highly accurate performance. This model also has feasible inference speed. It has about 15 FPS at 1080ti (MTCNN+one face embedding).

Here are some results:

I “register” the Queen’s face to the face space with below picture: (face detection with MTCNN)

--

--