2013 IEEE International Conference on Computer Vision Workshops
Extensive Facial Landmark Localization
with Coarse-to-fine Convolutional Network Cascade
Erjin Zhou Haoqiang Fan Zhimin Cao Yuning Jiang Qi Yin Megvii Inc. {zej,fhq,czm,jyn,yq}@megvii.com
Abstract
We present a new approach to localize extensive facial landmarks with a coarse-to-fine convolutional network cas- cade. Deep convolutional neural networks (DCNN) have been successfully utilized in facial landmark localization for two-fold advantages: 1) geometric constraints among facial points are implicitly utilized; 2) huge amount of train- ing data can be leveraged. However, in the task of exten- sive facial landmark localization, a large number of fa- cial landmarks (more than 50 points) are required to be located in a unified system, which poses great difficulty in the structure design and training process of traditional con- volutional networks. In this paper, we design a four-level convolutional network cascade, which tackles the problem in a coarse-to-fine manner. In our system, each network level is trained to locally refine a subset of facial land- marks generated by previous network levels. In addition, each level predicts explicit geometric constraints (the posi- tion and rotation angles of a specific facial component) to rectify the inputs of the current network level. The combi- nation of coarse-to-fine cascade and geometric refinement enables our system to locate extensive facial landmarks (68 points) accurately in the 300-W facial landmark localiza- tion challenge.
1. Introduction
Facial landmark localization plays a critical role in the systems of face recognition and face analysis. In a recent paper of Chen’s [4], it is shown that simple features can achieve leading performance on face recognition if accu- rate facial landmarks can be utilized. For this reason, the problem of facial landmark localization has attracted exten- sive interests in the past years. In general, there are three main methods to locate the facial landmarks from a face image: the first category performs a sliding window search based on local-patch classifiers, which encounters the prob- lems of the ambiguity or corruption in local features. Be-
978-10-74767995-35016221-67/13 $31.00 © 2013 IEEE DOI 10.1109/ICCVW.2013.58
Figure 1. Comparison of landmark localization systems. The first row is the original facial image. The second row is produced by local-patch detectors included in OpenCV [3]. The third row is produced by Stasm [9], an open source AAM implementation. Our result is shown in the fourth row, which outperforms the rest significantly.
sides, it is difficult to incorporate the global contextual in- formation into the local search framework; the second cate- gory of methods is the well-known framework of the Active Shape Model (ASM) [2] and the Active Appearance Model (AAM) [5]. These methods fit a generative model for the global facial appearance and hence are robust to local cor- ruptions. However, to estimate the parameters in the gener- ative models, expensive iterative steps are required.
Recently, a new framework based on explicitly regres- sion methods [10, 11] has been proposed. In this frame- work, the problem of landmark localization is considered directly as a regression task, and a holistic regressor is used to compute the landmark coordinates. Compared to the aforementioned methods, this framework is more robust and stable since the global contextual information is incorpo-
386
Figure 2. System overview. The first-level network predicts the bounding boxes for the inner points and contour points separately. For the inner points, the second level predicts an initial estimation of the positions which are refined by the third level for each component. The fourth level is used to improve the predictions of mouth and eyes by taking the rotated image patch as new input. Two levels of separate networks are used for contour points. For clarity reason, not all of the 68 points are rendered in the figure.
rated at the very beginning; it is also more efficient since no iterative fitting step or sliding window search is required. Instead of the random ferns used in [10], Sun [11] applies more powerful deep convolutional neural network (DCNN) in the regression framework and achieves the state-of-the- art performance.
However, facial landmark localization remains a very challenging problem. The challenge comes from the large variations of facial appearance due to the changes in pose, lightening, expression and etc. The task is even more chal- lenging when a large number of landmark points is required. The nature of the challenge varies dramatically across dif- ferent facial points, so a single-model method would prob- ably fail. On the other hand, employing individual systems for each point sharply increases computational time. How- ever, the large number of points is a two-edge sword: valu- able information pertaining to the inner structure of the rel- ative position of the landmarks becomes present. The geo- metric constraints on the global arrangement of facial com- ponents and the interaction of points inside a component provides hope for improvement in accuracy and robustness if the system amply exploits them.
To address the challenge, we carefully design a multi- level convolutional network cascade, which tackles the task of extensive facial landmark localization with a coarse-to- fine network cascade. Our contributions are three-fold: 1) unlike [11] predicts sparse facial landmarks (5 points) with network cascade, we validate the effectiveness of convolu- tional network cascade for the problem of extensive facial landmark localization; 2) we design a coarse-to-fine net- work cascade to spread the network complexity and train-
ing burden of traditional convolutional networks; 3) we show that explicit geometric refinement (estimate the po- sition/rotation of facial components and rectify the inputs of each network level) can improve the accuracy and ro- bustness significantly. Extensive experiments show that our system is accurate and robust.
2. Overview
Figure 2 gives a brief illustration of our multi-level facial landmark localization system. We use the term inner points to denote the 51 points for eyes, eyebrows, mouth and nose, and contour points for the 17 points on the contour. The subsystems for the inner points and contour points are sep- arated from the first level. In the first level, two neural net- works are trained to estimate the bounding boxes (the max- imum and minimum value of the x-y coordinates) for the inner points and contour points independently. The boxes are fed into the rest of the system respectively.
Inner points. For the inner 51 points, three levels of convolutional neural networks are trained in addition. Af- ter obtaining the bounding box of inner points, the 51 inner landmarks are initially estimated by the second level. Based on the initial estimation, the regions for 6 facial components (i.e., eyebrows, eyes, mouth and nose) are computed in sep- arate. The third level is trained to refine the landmarks of each facial component independently. The rotation angle of each component is estimated and corrected to upright, and the rotated patches are fed to the fourth level network for the final results.
Contour points. A simpler network cascade is utilized
3
3
38
8
8
7
7
7
for the localization of contour points. Given the bounding box covering the cheek, the second level takes the cropped image as input and computes the coordinates of the contour points from the raw pixels. Third and fourth level networks are not utilized due to the limited time, and we leave the fur- ther exploitation of deeper network cascade to future work.
3. Coarse-to-fine DCNN cascade
The central idea of our framework is the design of coarse-to-fine cascade. Each network level refines a sub- set of the landmarks inside a region computed by previ- ous levels. In the first level, the face is divided into two parts : inner and contour. After the second level, the facial components of inner part are further separated. We do not train individual networks for each facial landmark to reduce computational cost. There are multiple advantages of the coarse-to-fine framework.
3.1. Separation of the loss function
The hardness of localization is unbalanced across dif- ferent landmarks. Particularly, the contour is significantly more difficult than inner points for two reasons. First, the facial image provides less local texture information for con- tour points compared to the inner landmarks, but the irrel- evant information from the background near these points is noticeably more. Additionally, the ground truth for these points is by nature more noisy, because the definition of the exact position of each point is more ambiguous. These fac- tors result in the heavy imbalance between the training er- rors of the two parts, hence the L2 loss function will be dominated by the contour if all 68 points are trained to- gether. So training two independent subsystems give the whole system a chance to learn the detailed structure of in- ner points instead of devoting most of its capacity to fitting the “difficult” contour. This argument is supported by our experiment.
Among the inner points, the relative difficulties of the fa- cial components are still not uniform. As shown in Section 5, eyebrows are notably harder whilst the system’s predic- tion on eyes is more accurate.
3.2. Multi-level refinement
The localization task is decomposed into multiple stages at each of which the interaction between the points or com- ponents is considered. In the first level, the relative position of the face contour, which is closely related to the pose of the face, is computed. In higher levels, more detailed infor- mation is revealed step by step. The second level network learns the relative location of the facial components, and the task of recognizing the shape inside a component is handled by succeeding levels. It is possible that the third level net- work is compromised by local corruption. However, since
global information is taken account in the second level, the final output still makes sense.
The bounding box carries the information of the position and range of the group of points to the next level. Thus the image inside the box is generally well aligned in terms of translation and scaling. In contrast, the rectangle generated by the face detector is far from satisfactory. In some cases, it contains too much irrelevant background information that confuses the neural network. Moreover, the face is not al- ways centered in the rectangle, which further complicates the localization task for the system.
DCNN is generally considered to be powerful enough to handle great variation in the input image, but the capacity of a single network is still limited by its size. Given insuffi- cient prior knowledge, the network will devote a consider- able part of its power to finding where the face is. To tackle the problem, the “divide-and-conquer” strategy is adopted, which divides the task into two steps: first to find the over- all position, then to compute the relative position inside the region. For the whole face, the first step is performed by the first level networks whose supervision signal does not include the detailed structure of the points inside the bound- ing box, and the rest of the task is left to succeeding levels. In this way, the burden is shared across networks in differ- ent levels, and good performance is achieved by networks of only moderate size.
The idea is extended further in the third and fourth level where the orientation is canonicalized by means of a rota- tion of the image patch. Rotation is considered only after the third level since the consequence caused by failure to predict a robust rotation angle in the early levels is serious. Experimental results show that the fourth level gives a per- formance gain that is not as dramatic as the previous levels but absolutely non-negligible.
4. Implementation Details
Deep convolutional neural network. We use DCNN as the basic building block of the system. The network takes the raw pixels as input and performs regression on the coordinates of the desired points. Figure 3 is an illus- tration of the deep architecture. Three convolutional layers are stacked after the input nodes. Each convolutional layer applies several filters to the multichannel input image and output the responses. Let the input to the t-th convolutional layer be I t , then the output is computed according to
3
3
3
8
8
8
8
8
8
Ct i,j,k
ht−1 wt−1 ct−1 =|tanh( It−1
x=0 y=0 z=0
·Ft +B )| x,y,k,z k
where I represents the input to the convolutional layer, F and B are tunable parameters. Following the standard prac- tice, hyper-tangent and absolute value function are applied to the filter responses to bring non-linearity to the system.
i−x,j−y,z
network
N1
N2
N3
input conv. 1 conv. 2 conv. 3 unshared hidden
60×60
5x5x20
5x5x40
3x3x60
3x3x80
40×40
5x5x20
3x3x40
3x3x60
2x2x80
40×40
5x5x20
3x3x40
3x3x60
2x2x80
120
Figure 3. Typical structure of networks in our system. The network consists of convolutional layers, unshared convolutional layers and fully-connected layers. Max-pooling is performed af- ter convolutional layers. In unshared convolutional layers, the weights used in different positions are different. Tanh and absolute value non-linearity is inserted between the layers. The architec- tures of other networks are similar to this.
Max-pooling with non-overlapping pooling regions is used after convolution
Table 1. Resolution, filter size and number of channels of the net- works. N1 is used for inner points in the second level. N2 is used for contour points. N3 is used for others. Two fully connected lay- ers are used in N3 and there are 120 hidden units between them. In N1 and N2, one fully connected layer directly connects the output units and the unshared convolutional layer.
scaling) before feeding into the network. This step creates virtually infinite number of training samples and keeps the training error close to the error on our validation set. Also, we flip the image to reuse the left eye’s model for the right eye, and left eye-brow for right eye-brow.
Image Processing. Image patch is normalized to zero- mean and unit-variance, then a hyper-tangent function is ap- plied so that the pixel values fall in the range of [−1, 1]. When cropping the image inside a bounding box, the box is enlarged by 10% to 20%. More context information is retained by the enlargement, and it allows the system the tolerate small failures in the bounding box estimation step. In the fourth level, the rotation angle is computed from the position of two corner points in the facial component.
5. Experiment
We conducted our experiments on a dataset contain- ing 3837 images provided by the 300-Faces in the Wild Challenge. The images and annotations come from AFW, LFPW, HELEN, and IBUG [6, 1, 12, 7, 8]. A subset of 500 images are randomly selected as our validation set. Two performance metrics are used on the validation set: the first one is the average distance between the predicted landmark positions and the ground truth normalized by inter-ocular distances
1 N 1 M | p i , j − g i , j | 2 err = M j=1
It = max (Ct
i,j,k 0≤x