當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Paper：《How far are we from solving the 2D 3D Face Alignment problem? 》解读与翻译

發(fā)布時(shí)間：2025/3/21 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 Paper：《How far are we from solving the 2D 3D Face Alignment problem? 》解读与翻译小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Paper：《How far are we from solving the 2D & 3D Face Alignment problem? 》解讀與翻譯

How far are we from solving the 2D & 3D Face Alignment problem?

Abstract

1. Introduction

2. Closely related work

3. Datasets

3.1. Training datasets

3.2. Test datasets

3.3. Metrics

4. Method

4.3. Training

5. 2D face alignment

6. Large Scale 3D Faces in-the-Wild dataset

7. 3D face alignment

8. Ablation studies

9. Conclusions

10. Acknowledgments?

How far are we from solving the 2D & 3D Face Alignment problem?

Adrian Bulat and Georgios Tzimiropoulos Computer Vision Laboratory, The University of Nottingham Nottingham, United Kingdom
官網(wǎng)：https://www.adrianbulat.com/face-alignment
原文地址：https://arxiv.org/pdf/1703.07332.pdf

Abstract

This paper investigates how far a very deep neural network ?is from attaining close to saturating performance on ?existing 2D and 3D face alignment datasets. To this end, ?we make the following 5 contributions:

(a) we construct, ?for the first time, a very strong baseline by combining a ?state-of-the-art architecture for landmark localization with ?a state-of-the-art residual block, train it on a very large yet ?synthetically expanded 2D facial landmark dataset and finally ?evaluate it on all other 2D facial landmark datasets.
?(b) We create a guided by 2D landmarks network which converts ?2D landmark annotations to 3D and unifies all existing ?datasets, leading to the creation of LS3D-W, the largest ?and most challenging 3D facial landmark dataset to date ?(~230,000 images).
(c) Following that, we train a neural ?network for 3D face alignment and evaluate it on the newly ?introduced LS3D-W.
(d) We further look into the effect of all ?“traditional” factors affecting face alignment performance ?like large pose, initialization and resolution, and introduce ?a “new” one, namely the size of the network.
(e) We show ?that both 2D and 3D face alignment networks achieve performance ?of remarkable accuracy which is probably close ?to saturating the datasets used.

Training and testing code ?as well as the dataset can be downloaded from https: ?//www.adrianbulat.com/face-alignment/

本文研究了在現(xiàn)有的二維和三維人臉定位數(shù)據(jù)集上，一個(gè)非常深入的神經(jīng)網(wǎng)絡(luò)離達(dá)到接近飽和的性能還有多遠(yuǎn)。為此，我們做出了以下5項(xiàng)貢獻(xiàn)：

（a）我們首次通過(guò)將最先進(jìn)的landmark 定位架構(gòu)與最先進(jìn)的殘差結(jié)合，構(gòu)建了一個(gè)非常強(qiáng)大的基線，在一個(gè)非常大但經(jīng)過(guò)綜合擴(kuò)展的2D面部landmark 數(shù)據(jù)集上訓(xùn)練，最后在所有其他2D面部landmark 數(shù)據(jù)集上對(duì)其進(jìn)行評(píng)估。
（b）我們創(chuàng)建了一個(gè)由二維landmark 網(wǎng)絡(luò)引導(dǎo)的網(wǎng)絡(luò)，該網(wǎng)絡(luò)將二維landmark 注釋轉(zhuǎn)換為三維，并統(tǒng)一了所有現(xiàn)有數(shù)據(jù)集，從而創(chuàng)建了迄今為止最大、最具挑戰(zhàn)性的三維面部landmark 數(shù)據(jù)集LS3D-W（約23萬(wàn)張圖像）。
（c）然后，我們訓(xùn)練了一個(gè)用于三維人臉對(duì)齊的神經(jīng)網(wǎng)絡(luò)，并在新引入的LS3D-W。
（d）我們進(jìn)一步研究了影響人臉對(duì)齊性能的所有“傳統(tǒng)”因素的影響，如大姿態(tài)、初始化和分辨率，并引入了一個(gè)“新的”因素，即網(wǎng)絡(luò)的大小。
（e）結(jié)果表明，二維和三維人臉對(duì)齊網(wǎng)絡(luò)均達(dá)到了很高的精度，可能接近于飽和所使用的數(shù)據(jù)集。

訓(xùn)練和測(cè)試代碼以及數(shù)據(jù)集可從/https: ?//www.adrianbulat.com/face-alignment/? 下載

1. Introduction

With the advent of Deep Learning and the development ?of large annotated datasets, recent work has shown results ?of unprecedented accuracy even on the most challenging ?computer vision tasks. In this work, we focus on landmark ?localization, in particular, on facial landmark localization, ?also known as face alignment, arguably one of the ?most heavily researched topics in computer vision over the ?last decades. Very recent work on landmark localization ?using Convolutional Neural Networks (CNNs) has pushed the boundaries in other domains like human pose estimation ?[39, 38, 24, 17, 27, 42, 23, 5], yet it remains unclear what ?has been achieved so far for the case of face alignment. The ?aim of this work is to address this gap in literature. ?Historically, different techniques have been used for ?landmark localization depending on the task in hand. For ?example, work in human pose estimation, prior to the advent ?of neural networks, was primarily based on pictorial ?structures [12] and sophisticated extensions [44, 25, ?36, 32, 26] due to their ability to model large appearance ?changes and accommodate a wide spectrum of human ?poses. Such methods though have not been shown capable ?of achieving the high degree of accuracy exhibited by ?cascaded regression methods for the task of face alignment ?[11, 8, 43, 50, 41]. On the other hand, the performance ?of cascaded regression methods is known to deteriorate for ?cases of inaccurate initialisation, and large (and unfamiliar) ?facial poses when there is a significant number of selfoccluded ?landmarks or large in-plane rotations.

隨著深度學(xué)習(xí)的到來(lái)和大型注釋數(shù)據(jù)集的發(fā)展，最近的工作已經(jīng)顯示出前所未有的準(zhǔn)確性，甚至在最具挑戰(zhàn)性的計(jì)算機(jī)視覺(jué)任務(wù)的結(jié)果。在這項(xiàng)工作中，我們重點(diǎn)關(guān)注landmark 定位，特別是面部landmark 定位，也被稱為面部對(duì)齊，可以說(shuō)是過(guò)去幾十年計(jì)算機(jī)視覺(jué)中研究最多的主題之一。最近使用卷積神經(jīng)網(wǎng)絡(luò)(CNNs)進(jìn)行地標(biāo)定位的工作已經(jīng)在其他領(lǐng)域如人體姿態(tài)估計(jì)[39,38,24,17,27,42,23,5]中突破了界限，但目前還不清楚在人臉對(duì)齊方面取得了什么進(jìn)展。這項(xiàng)工作的目的是解決這個(gè)差距在文學(xué)。歷史上，根據(jù)手頭的任務(wù)不同，使用了不同的技術(shù)來(lái)進(jìn)行l(wèi)andmark 定位。例如，在神經(jīng)網(wǎng)絡(luò)出現(xiàn)之前，人類姿態(tài)估計(jì)的工作主要基于圖像結(jié)構(gòu)[12]和復(fù)雜的擴(kuò)展[44、25、36、32、26]，因?yàn)樗鼈兡軌蚰M大的外觀變化并適應(yīng)廣泛的人類姿態(tài)。然而，這種方法還沒(méi)有被證明能夠達(dá)到用于面部對(duì)準(zhǔn)任務(wù)的級(jí)聯(lián)回歸方法所顯示的高度準(zhǔn)確性[11,8,43,50,41]。另一方面，在初始化不準(zhǔn)確的情況下，級(jí)聯(lián)回歸方法的性能會(huì)下降，當(dāng)有大量自聚焦landmark 或大的面內(nèi)旋轉(zhuǎn)時(shí)，會(huì)出現(xiàn)較大的(和不熟悉的)面部姿勢(shì)。

圖1:由4個(gè)HGs疊加而成的人臉比對(duì)網(wǎng)絡(luò)(FAN)，其中所有的瓶頸塊(以矩形表示)都被替換為[7]的分層、平行和多尺度塊。

More recently, fully Convolutional Neural Network architectures ?based on heatmap regression have revolutionized ?human pose estimation [39, 38, 24, 17, 27, 42, 23, 5] ?producing results of remarkable accuracy even for the most ?challenging datasets [1]. Thanks to their end-to-end training ?and little need for hand engineering, such methods can be ?readily applied to the problem of face alignment. Following ?this path, our main contribution is to construct and train ?such a powerful network for face alignment and investigate ?for the first time how far it is from attaining close to saturating ?performance on all existing 2D face alignment datasets ?and a newly introduced large scale 3D dataset. More specifically, ?our contributions are:

1. We construct, for the first time, a very strong baseline by combining a state-of-the-art architecture for landmark localization with a state-of-the-art residual block and train it on a very large yet synthetically expanded 2D facial landmark dataset. Then, we evaluate it on all other 2D datasets (~230,000 images), investigating how?far are we from solving 2D face alignment
2. In order to overcome the scarcity of 3D face alignment datasets, we further propose a guided-by-2D landmarks CNN which converts 2D annotations to 3D 1 and use it to create LS3D-W, the largest and most challenging 3D facial landmark dataset to date (~230,000 images), obtained from unifying almost all existing datasets to date.
3. Following that, we train a 3D face alignment network and then evaluate it on the newly introduced large scale 3D facial landmark dataset, investigating how far are we from solving 3D face alignment.
4. We further look into the effect of all “traditional” factors affecting face alignment performance like large pose, initialization and resolution, and introduce a “new” one, namely the size of the network.
5. We show that both 2D and 3D face alignment networks achieve performance of remarkable accuracy which is probably close to saturating the datasets used.

最近，基于熱圖回歸的全卷積神經(jīng)網(wǎng)絡(luò)架構(gòu)已經(jīng)徹底改變了人體姿態(tài)估計(jì)[39,38,24,17,27,42,23,5]，即使對(duì)于最具挑戰(zhàn)性的數(shù)據(jù)集[1]也能產(chǎn)生非常精確的結(jié)果。由于他們的端到端訓(xùn)練和很少需要手工程，這種方法可以很容易地應(yīng)用于面部對(duì)準(zhǔn)的問(wèn)題。沿著這條道路，我們的主要貢獻(xiàn)是構(gòu)建和訓(xùn)練這樣一個(gè)強(qiáng)大的人臉比對(duì)網(wǎng)絡(luò)，并首次研究如何在現(xiàn)有的所有2D人臉比對(duì)數(shù)據(jù)集和新引入的大規(guī)模3D數(shù)據(jù)集上實(shí)現(xiàn)接近飽和的性能。更具體地說(shuō)，我們的貢獻(xiàn)是:

1. 我們首次構(gòu)建了一個(gè)非常強(qiáng)大的基線，將最先進(jìn)的landmark 定位體系結(jié)構(gòu)與最先進(jìn)的殘塊相結(jié)合，并將其訓(xùn)練在一個(gè)非常大但綜合擴(kuò)展的2D面部地標(biāo)數(shù)據(jù)集上。然后，我們?cè)谒衅渌?D數(shù)據(jù)集(~230,000張圖像)上對(duì)其進(jìn)行評(píng)估，研究我們離解決2D人臉對(duì)齊問(wèn)題還有多遠(yuǎn)
2. 為了克服缺乏真實(shí)感三維人臉對(duì)齊的數(shù)據(jù)集,我們進(jìn)一步提出一個(gè)guided-by-2D地標(biāo)CNN把1到3 d和2 d注釋使用它來(lái)創(chuàng)建LS3D-W,最大的和最具挑戰(zhàn)性的3 d面部(~ 230000圖像)迄今具有里程碑意義的數(shù)據(jù)集,從統(tǒng)一獲得幾乎所有現(xiàn)有的數(shù)據(jù)集。
3.然后，我們訓(xùn)練一個(gè)三維人臉比對(duì)網(wǎng)絡(luò)，然后在新引入的大規(guī)模三維面部地標(biāo)數(shù)據(jù)集上對(duì)其進(jìn)行評(píng)估，研究我們離解決三維人臉比對(duì)問(wèn)題還有多遠(yuǎn)。
4. 我們進(jìn)一步研究了影響人臉對(duì)齊性能的所有“傳統(tǒng)”因素的影響，如大姿態(tài)、初始化和分辨率，并引入了一個(gè)“新的”因素，即網(wǎng)絡(luò)的大小。
5. 結(jié)果表明，無(wú)論是二維還是三維人臉對(duì)準(zhǔn)網(wǎng)絡(luò)，其精度都達(dá)到了很高的水平，接近所使用數(shù)據(jù)集的飽和狀態(tài)。

2. Closely related work

This Section reviews related work on face alignment and landmark localization. Datasets are described in detail in the next Section. 2D face alignment. Prior to the advent of Deep Learning, methods based on cascaded regression had emerged as the state-of-the-art in 2D face alignment, see for example [8, 43, 50, 41]. Such methods are now considered to have largely “solved” the 2D face alignment problem for faces with controlled pose variation like the ones of LFPW [2], Helen [22] and 300-W [30].

We will keep the main result from these works, namely their performance on the frontal dataset of LFPW [2]. This performance will be used as a measure of comparison of how well the methods described in this paper perform assuming that a method achieving a similar error curve on a different dataset is close to saturating that dataset.

本節(jié)回顧了人臉定位和地標(biāo)定位的相關(guān)工作。數(shù)據(jù)集將在下一節(jié)中詳細(xì)描述。在深度學(xué)習(xí)出現(xiàn)之前，基于級(jí)聯(lián)回歸的方法已經(jīng)成為二維人臉對(duì)準(zhǔn)的最新技術(shù)，參見(jiàn)[8,43,50,41]。這種方法現(xiàn)在被認(rèn)為在很大程度上“解決”了具有受控位姿變化的人臉的2D人臉對(duì)準(zhǔn)問(wèn)題，比如LFPW[2]、Helen[22]和300-W[30]。

我們將保留這些工作的主要結(jié)果，即它們?cè)贚FPW[2]的正面數(shù)據(jù)集上的性能。如果在不同的數(shù)據(jù)集上實(shí)現(xiàn)類似的錯(cuò)誤曲線的方法接近于該數(shù)據(jù)集的飽和狀態(tài)，那么本文中描述的方法的性能將被用來(lái)比較它們的執(zhí)行情況。

CNNs for face alignment. By no means we are the first?to use CNNs for face alignment. The method of [35] uses a CNN cascade to regress the facial landmark locations. The work in [47] proposes multi-task learning for joint facial landmark localization and attribute classification. More recently, the method of [40] extends [43] within recurrent neural networks. All these methods have been mainly shown effective for the near-frontal faces of 300-W [30].

Recent works on large pose and 3D face alignment includes [20, 50] which perform face alignment by fitting a 3D Morphable Model (3DMM) to a 2D facial image. The work in [20] proposes to fit a dense 3DMM using a cascade of CNNs. The approach of [50] fits a 3DMM in an iterative manner through a single CNN which is augmented by additional input channels (besides RGB) representing shape features at each iteration. More recent works that are closer to the methods presented in this paper are [4] and [6]. Nevertheless, [4] is evaluated on [20] which is a relatively small dataset (3900 images for training and 1200 for testing) and [6] on [19] which is of moderate size (16,2000 images for training and 4,900 for testing), includes mainly images collected in the lab and does not cover the full spectrum of facial poses. Hence, the results of [4] and [6] are not conclusive in regards to the main questions posed in our paper.

Landmark localization. A detailed review of state-ofthe-art methods on landmark localization for human pose estimation is beyond the scope of this work, please see [39, 38, 24, 17, 27, 42, 23, 5]. For the needs of this work, we built a powerful CNN for 2D and 3D face alignment based on two components: (a) the state-of-the-art HourGlass (HG) network of [23], and (b) the hierarchical, parallel & multi-scale block recently proposed in [7]. In particular, we replaced the bottleneck block [15] used in [23] with the block proposed in [7].

用于面對(duì)齊的CNNs。我們絕不是第一個(gè)使用CNNs進(jìn)行人臉校準(zhǔn)的公司。[35]方法利用CNN級(jí)聯(lián)反演面部地標(biāo)位置。在[47]的工作中，提出了多任務(wù)學(xué)習(xí)聯(lián)合面部landmark 定位和屬性分類。最近，[40]方法在循環(huán)神經(jīng)網(wǎng)絡(luò)中擴(kuò)展了[43]。這些方法主要用于300-W[30]的近正面。

最近在大姿態(tài)和3D人臉對(duì)齊方面的工作包括[20,50]，這些工作通過(guò)將3D Morphable Model (3DMM)擬合到2D面部圖像來(lái)進(jìn)行人臉對(duì)齊。在[20]的工作建議適應(yīng)一個(gè)稠密的3DMM使用級(jí)聯(lián)的CNNs。[50]的方法通過(guò)一個(gè)單獨(dú)的CNN以迭代的方式對(duì)3DMM進(jìn)行了擬合，在每次迭代中，CNN被表示形狀特征的額外輸入通道(除了RGB)所增強(qiáng)。更接近本文方法的是[4]和[6]。然而,評(píng)估[4]在[20]這是一個(gè)相對(duì)較小的數(shù)據(jù)集(培訓(xùn)3900張圖片和1200年測(cè)試)和[6][19]這是溫和的大小(2000圖片4900培訓(xùn)和測(cè)試),包括在實(shí)驗(yàn)室主要是圖像采集和不包括面部造成的全譜。因此，對(duì)于本文提出的主要問(wèn)題，[4]和[6]的結(jié)果并不是決定性的。

具有里程碑意義的本地化。對(duì)用于人體姿態(tài)估計(jì)的地標(biāo)定位方法的詳細(xì)回顧超出了本工作的范圍，請(qǐng)參見(jiàn)[39,38,24,17,27,42,23,5]。為了滿足這項(xiàng)工作的需要，我們構(gòu)建了一個(gè)強(qiáng)大的基于兩個(gè)組件的二維和三維人臉校準(zhǔn)CNN: (a)[23]的最先進(jìn)的沙漏(HG)網(wǎng)絡(luò)，和(b)最近在[7]中提出的分層、并行和多尺度塊。特別是，我們用[7]中提出的塊替換了[23]中使用的瓶頸塊[15]。

Transferring landmark annotations. There are a few works that have attempted to unify facial alignment datasets by transferring landmark annotations, typically through exploiting common landmarks across datasets [49, 34, 46]. Such methods have been primarily shown to be successful when landmarks are transferred from more challenging to less challenging images, for example in [49] the target?dataset is LFW [16] or [34] provides annotations only for the relatively easy images of AFLW [21]. Hence, the community primarily relies on the unification performed manually by the 300-W challenge [29] which contains less than 5,000 near frontal images annotated from a 2D perspective.

Using 300-W-LP [50] as a basis, this paper presents the first attempt to provide 3D annotations for all other datasets, namely AFLW-2000 [50] (2,000 images), 300-W test set [28] (600 images), 300-VW [33] (218,595 frames), and Menpo training set (9,000 images). To this end, we propose a guided-by-2D landmarks CNN which converts 2D annotations to 3D and unifies all aforementioned datasets.

傳輸具有里程碑意義的注釋。有一些研究試圖通過(guò)傳輸?shù)貥?biāo)注釋來(lái)統(tǒng)一面部定位數(shù)據(jù)集，通常是通過(guò)跨數(shù)據(jù)集利用共同的地標(biāo)[49,34,46]。這些方法已經(jīng)被初步證明是成功的，當(dāng)?shù)貥?biāo)從更具挑戰(zhàn)性的圖像轉(zhuǎn)移到不太具有挑戰(zhàn)性的圖像時(shí)，例如在[49]中，目標(biāo)數(shù)據(jù)集是LFW[16]或[34]僅為AFLW[21]的相對(duì)簡(jiǎn)單的圖像提供注釋。因此，社區(qū)主要依賴于由300-W挑戰(zhàn)[29]手動(dòng)執(zhí)行的統(tǒng)一，該挑戰(zhàn)包含少于5000張從2D角度注釋的近正面圖像。

本文以300-W- lp[50]為基礎(chǔ)，首次嘗試為所有其他數(shù)據(jù)集提供3D標(biāo)注，即AFLW-2000[50](2000張圖像)、300-W測(cè)試集[28](600張圖像)、300-VW[33](218595幀)和Menpo訓(xùn)練集(9000張圖像)。為此，我們提出了一個(gè)二維路標(biāo)CNN，它可以將二維注解轉(zhuǎn)換為三維，并統(tǒng)一所有上述數(shù)據(jù)集。

3. Datasets

In this Section, we provide a description of how existing 2D and 3D datasets were used for training and testing for the purposes of our experiments. We note that the 3D annotations preserve correspondence across pose as opposed to the 2D ones and, in general, they should be preferred. We emphasize that the 3D annotations are actually the 2D projections of the 3D facial landmark coordinates but for simplicity we will just call them 3D. In the supplementary material, we present a method for extending these annotations to full 3D. Finally, we emphasize that we performed cross-database experiments only.

在本節(jié)中，我們將描述如何使用現(xiàn)有的2D和3D數(shù)據(jù)集進(jìn)行實(shí)驗(yàn)?zāi)康牡呐嘤?xùn)和測(cè)試。我們注意到，與2D注釋相比，3D注釋保留了各個(gè)姿勢(shì)之間的對(duì)應(yīng)關(guān)系，一般來(lái)說(shuō)，它們應(yīng)該是首選。我們強(qiáng)調(diào)，3D標(biāo)注實(shí)際上是3D面部地標(biāo)坐標(biāo)的2D投影，但為了簡(jiǎn)單起見(jiàn)，我們將其稱為3D。在補(bǔ)充材料中，我們提出了一種將這些注釋擴(kuò)展到完整3D的方法。最后，我們強(qiáng)調(diào)我們只執(zhí)行跨數(shù)據(jù)庫(kù)實(shí)驗(yàn)。

Table 1: Summary of the most popular face alignment datasets and their main characteristics.

表1:最流行的人臉比對(duì)數(shù)據(jù)集及其主要特征的摘要。

3.1. Training datasets

For training and validation, we used 300-W-LP [50], a synthetically expanded version of 300-W [29]. 300-W-LP provides both 2D and 3D landmarks allowing for training models and conducting experiments using both types of annotations. For some 2D experiments, we also used the original 300-W dataset [29] for fine tuning, only. This is because the 2D landmarks of 300-W-LP are not entirely compatible with the 2D landmarks of the test sets used in our experiments (i.e. 300-W test set, [28], 300-VW [33] and Menpo [45]), but the original annotations from 300-W are.?

300-W. 300-W [29] is currently the most widely-used inthe-wild dataset for 2D face alignment. The dataset itself is a concatenation of a series of smaller datasets: LFPW [3], HELEN [22], AFW [51] and iBUG [30], where each image was re-annotated in a consistent manner using the 68 2D landmark configuration of Multi-PIE [13]. The dataset contains in total ~4,000 near frontal facial images. 300W-LP-2D and 300W-LP-3D. 300-W-LP is a synthetically generated dataset obtained by rendering the faces of 300-W into larger poses, ranging from ?900 to 900 , using the profiling method of [50]. The dataset contains 61,225 images providing both 2D (300W-LP-2D) and 3D landmark annotations (300W-LP-3D).

為了進(jìn)行培訓(xùn)和驗(yàn)證，我們使用了300-W- lp[50]，這是300-W[29]的綜合擴(kuò)展版本。300-W-LP提供了2D和3D地標(biāo)，允許使用這兩種類型的注釋來(lái)訓(xùn)練模型和進(jìn)行實(shí)驗(yàn)。對(duì)于一些2D實(shí)驗(yàn)，我們也僅使用原始的300-W數(shù)據(jù)集[29]進(jìn)行微調(diào)。這是因?yàn)?00w - lp的2D landmark與我們實(shí)驗(yàn)中使用的測(cè)試集(即300w test set， [28]， 300vw [33]， Menpo[45])的2D landmark并不完全兼容，而300w原始的annotation卻兼容。

300 - w。300w[29]是目前野外應(yīng)用最廣泛的二維人臉比對(duì)數(shù)據(jù)集。數(shù)據(jù)集本身是一系列較小數(shù)據(jù)集的串聯(lián):LFPW[3]、HELEN[22]、AFW[51]和iBUG[30]，其中每個(gè)圖像都使用68個(gè)2D地標(biāo)式多餅[13]重新進(jìn)行了一致的注釋。該數(shù)據(jù)庫(kù)共包含近4000張正面人臉圖像。300年w-lp-2d和300 w-lp-3d。300w - lp是一個(gè)綜合生成的數(shù)據(jù)集，使用[50]的profiling方法，將300w的面渲染成更大的位姿，范圍從?900到900。數(shù)據(jù)集包含61,225張圖像，提供2D (300W-LP-2D)和3D地標(biāo)注釋(300W-LP-3D)。

3.2. Test datasets

This Section describes the test sets used for our 2D and 3D experiments. Observe that there is a large number of 2D datasets/annotations which are however problematic for moderately large poses (2D landmarks lose correspondence) and that the only in-the-wild 3D test set is AFLW2000-3D [50] 2 . We address this significant gap in 3D face alignment datasets in Section 6.

本節(jié)描述用于我們的2D和3D實(shí)驗(yàn)的測(cè)試集。注意，有大量的2D數(shù)據(jù)集/注釋，但是這些數(shù)據(jù)集/注釋對(duì)于中等大小的位姿是有問(wèn)題的(2D地標(biāo)丟失對(duì)應(yīng)關(guān)系)，并且唯一的野外3D測(cè)試集是AFLW2000-3D[50] 2。我們?cè)诘?節(jié)中解決了3D人臉對(duì)準(zhǔn)數(shù)據(jù)集中的這個(gè)重要差距。

3.2.1 2D datasets

300-W test set. The 300-W test set consists of the 600 images used for the evaluation purposes of the 300-W Challenge [28]. The images are split in two categories: Indoor and Outdoor. All images were annotated with the same 68 2D landmarks as the ones used in the 300-W data set. 300-VW. 300-VW[33] is a large-scale face tracking dataset, containing 114 videos and in total 218,595 frames. From the total of 114 videos, 64 are used for testing and 50 for training. The test videos are further separated into three categories (A, B, and C) with the last one being the most challenging. It is worth noting that some videos (especially from category C) contain very low resolution/poor quality faces. Due to the semi-automatic annotation approach (see [33] for more details), in some cases, the annotations for these videos are not so accurate (see Fig. 3). Another source of annotation error is caused by facial pose, i.e. large poses are also not accurately annotated (see Fig. 3). Menpo. Menpo is a recently introduced dataset [45] containing landmark annotations for about 9,000 faces from FDDB [18] and ALFW. Frontal faces were annotated in terms of 68 landmarks using the same annotation policy as the one of 300-W but profile faces in terms of 39 different landmarks which are not in correspondence with the landmarks from the 68-point mark-up.

3.2.1 2 d數(shù)據(jù)集

300-W測(cè)試集。300-W測(cè)試集包含用于300-W挑戰(zhàn)[28]的評(píng)估目的的600張圖像。這些圖片分為兩類:室內(nèi)和室外。所有的圖像都被標(biāo)注上了與300-W數(shù)據(jù)集中使用的相同的68個(gè)2D地標(biāo)。300-VW[33]是一個(gè)大型的人臉跟蹤數(shù)據(jù)集，包含114個(gè)視頻，總共218,595幀。在114個(gè)視頻中，有64個(gè)用于測(cè)試，50個(gè)用于培訓(xùn)。測(cè)試視頻進(jìn)一步分為三個(gè)類別(A、B和C)，最后一個(gè)是最具挑戰(zhàn)性的。值得注意的是，有些視頻(尤其是C類視頻)的分辨率很低，質(zhì)量很差。由于半自動(dòng)標(biāo)注方法(詳見(jiàn)[33])，在某些情況下，這些視頻的標(biāo)注并不十分準(zhǔn)確(見(jiàn)圖3)。另一個(gè)標(biāo)注錯(cuò)誤的來(lái)源是由面部姿態(tài)造成的，即大的pose也沒(méi)有準(zhǔn)確的標(biāo)注(見(jiàn)圖3)。Menpo是最近推出的一個(gè)數(shù)據(jù)集[45]，其中包含來(lái)自FDDB[18]和ALFW的大約9,000個(gè)面孔的地標(biāo)注釋。正面臉被標(biāo)注了68個(gè)地標(biāo)，使用與300-W相同的標(biāo)注策略，但側(cè)面臉被標(biāo)注了39個(gè)不同的地標(biāo)，這些地標(biāo)與68點(diǎn)標(biāo)記的地標(biāo)不一致。

3.2.2 3D datasets

AFLW2000-3D. AFLW2000-3D [50] is a dataset constructed by re-annotating the first 2000 images from AFLW [21] using 68 3D landmarks in a consistent manner with the?ones from 300W-LP-3D. The faces of this dataset contain large-pose variations (yaw from ?90o to 90o ), with various expressions and illumination conditions. However, some annotations, especially for larger poses or occluded faces are not so accurate (see Fig. 6).

3.2.2 3 d數(shù)據(jù)集

AFLW2000-3D。AFLW2000-3D[50]是一個(gè)數(shù)據(jù)集，它使用68個(gè)3D地標(biāo)，以與300W-LP-3D一致的方式重新注釋來(lái)自AFLW[21]的前2000個(gè)圖像。該數(shù)據(jù)集的面包含了較大的姿態(tài)變化(從?90o到90o的偏航)，具有不同的表達(dá)式和光照條件。然而，一些注釋，特別是對(duì)于較大的姿勢(shì)或遮擋的面部，并不是很準(zhǔn)確(見(jiàn)圖6)。

3.3. Metrics

Traditionally, the metric used for face alignment is the point-to-point Euclidean distance normalized by the interocular distance [10, 29, 33]. However, as noted in [51], this error metric is biased for profile faces for which the interocular distance can be very small. Hence, we normalize by the bounding box size. In particular, we used the Normalized Mean Error defined as:

where x denotes the ground truth landmarks for a given face, y the corresponding prediction and d is the squareroot of the ground truth bounding box, computed as d = √ wbbox ? hbbox. Although we conducted both 2D and 3D experiments, we opted to use the same bounding box definition for both experiments; in particular we used the bounding box calculated from the 2D landmarks. This way, we can readily compare the accuracy achieved in 2D and 3D.

傳統(tǒng)上，用于人臉對(duì)齊的度量是點(diǎn)對(duì)點(diǎn)歐幾里德距離，由眼間距歸一化[10,29,33]。然而，正如在[51]中所指出的，這種誤差度量對(duì)于眼間距可能非常小的側(cè)面是有偏差的。因此，我們通過(guò)邊界框大小進(jìn)行規(guī)范化。特別地，我們使用歸一化平均誤差定義為:

其中x為給定人臉的ground truth landmarks, y為相應(yīng)的預(yù)測(cè)，d為ground truth綁定框的squareroot，計(jì)算為d =√wbbox hbbox。雖然我們同時(shí)進(jìn)行了2D和3D實(shí)驗(yàn)，但我們選擇對(duì)兩個(gè)實(shí)驗(yàn)使用相同的邊界框定義;我們特別使用了從2D地標(biāo)計(jì)算出的邊界框。這樣，我們就可以很容易地比較2D和3D的精度。

4. Method

This Section describes FAN, the network used for 2D and 3D face alignment. It also describes 2D-to-3D FAN, the network used for constructing the very large scale 3D face alignment dataset (LS3D-W) containing more than 230,000 3D landmark annotations.

本節(jié)描述風(fēng)扇，用于二維和三維人臉對(duì)準(zhǔn)的網(wǎng)絡(luò)。它還描述了2D-to-3D FAN，用于構(gòu)建包含超過(guò)230,000個(gè)3D地標(biāo)注釋的超大規(guī)模3D人臉比對(duì)數(shù)據(jù)集(LS3D-W)的網(wǎng)絡(luò)。

4.1. 2D and 3D Face Alignment Networks

We coin the network used for our experiments simply Face Alignment Network (FAN). To our knowledge, it is the first time that such a powerful network is trained and evaluated for large scale 2D/3D face alignment experiments.

We construct FAN based on one of the state-of-the-art architectures for human pose estimation, namely the HourGlass (HG) network of [23]. In particularly, we used a stack of four HG networks (see Fig. 1). While [23] uses the bottleneck block of [14] as the main building block for the HG, we go one step further and replace the bottleneck block with the recently introduced hierarchical, parallel and multi-scale block of [7]. As it was shown in [7], this block outperforms the original bottleneck of [14] when the same number of network parameter were used. Finally, we used 300W-LP-2D and 300W-LP-3D to train 2D-FAN and 3DFAN, respectively.

Figure 2: The 2D-to-3D-FAN network used for the creation of the LS3D-W dataset. The network takes as input the RGB image and the 2D landmarks and outputs the corresponding 2D projections of the 3D landmarks.

4.1。二維和三維人臉對(duì)準(zhǔn)網(wǎng)絡(luò)

我們創(chuàng)造了用于我們的實(shí)驗(yàn)的網(wǎng)絡(luò)簡(jiǎn)單地面對(duì)對(duì)準(zhǔn)網(wǎng)絡(luò)(風(fēng)扇)。據(jù)我們所知，這是第一次為大規(guī)模的二維/三維人臉對(duì)準(zhǔn)實(shí)驗(yàn)訓(xùn)練和評(píng)估這樣一個(gè)強(qiáng)大的網(wǎng)絡(luò)。

我們基于最先進(jìn)的人體姿態(tài)估計(jì)架構(gòu)之一，即[23]的沙漏(HG)網(wǎng)絡(luò)來(lái)構(gòu)建風(fēng)扇。特別地，我們使用了4個(gè)HG網(wǎng)絡(luò)的堆棧(見(jiàn)圖1)。當(dāng)[23]使用[14]的瓶頸塊作為HG的主要構(gòu)建塊時(shí)，我們更進(jìn)一步，用最近引入的[7]的分級(jí)、并行和多尺度塊替換瓶頸塊。如[7]所示，當(dāng)使用相同數(shù)量的網(wǎng)絡(luò)參數(shù)時(shí)，此塊的性能優(yōu)于[14]的原始瓶頸。最后，我們使用300W-LP-2D和300W-LP-3D分別訓(xùn)練2D-FAN和3DFAN。

圖2:用于創(chuàng)建LS3D-W數(shù)據(jù)集的2D-to-3D-FAN網(wǎng)絡(luò)。網(wǎng)絡(luò)以RGB圖像和二維地標(biāo)為輸入，輸出相應(yīng)的三維地標(biāo)的二維投影。

4.2. 2D-to-3D Face Alignment Network

Our aim is to create the very first very large scale dataset of 3D facial landmarks for which annotations are scarce. To this end, we followed a guided-based approach in which a FAN for predicting 3D landmarks is guided by 2D landmarks. In particular, we created a 3D-FAN in which the input RGB channels have been augmented with 68 additional channels, one for each 2D landmark, containing a 2D Gaussian with std = 1px centered at each landmark’s location. We call this network 2D-to-3D FAN. Given the 2D facial landmarks for an image, 2D-to-3D FAN converts them to 3D. To train 2D-to-3D FAN, we used 300-W-LP which provides both 2D and 3D annotations for the same image. We emphasize again that the 3D annotations are actually the 2D projections of the 3D coordinates but for simplicity we call them 3D. Please see supplementary material for extending these annotations to full 3D.

4.2。二維到三維人臉對(duì)準(zhǔn)網(wǎng)絡(luò)

我們的目標(biāo)是創(chuàng)建第一個(gè)非常大的三維面部地標(biāo)數(shù)據(jù)集，其中注釋是稀缺的。為此，我們采用了基于指南的方法，其中用于預(yù)測(cè)3D地標(biāo)的風(fēng)扇由2D地標(biāo)引導(dǎo)。特別地，我們創(chuàng)建了一個(gè)3D-FAN，其中輸入RGB通道被增加了68個(gè)額外通道，每個(gè)2D地標(biāo)一個(gè)通道，包含一個(gè)2D高斯函數(shù)，std = 1px以每個(gè)地標(biāo)的位置為中心。我們稱這個(gè)網(wǎng)絡(luò)為2D-to-3D風(fēng)扇。給定圖像的2D面部地標(biāo)，2D-to-3D FAN將其轉(zhuǎn)換為3D。為了訓(xùn)練2D- To -3D風(fēng)扇，我們使用了300-W-LP，它為相同的圖像提供了2D和3D注釋。我們?cè)俅螐?qiáng)調(diào)，3D標(biāo)注實(shí)際上是3D坐標(biāo)的2D投影，但為了簡(jiǎn)單起見(jiàn)，我們稱它們?yōu)?D。請(qǐng)參閱補(bǔ)充材料擴(kuò)展這些注釋到完整的3D。

4.3. Training

For all of our experiments, we independently trained three distinct networks: 2D-FAN, 3D-FAN, and 2D-to-3DFAN. For the first two networks, we set the initial learning rate to 10?4 and used a minibatch of 10. During the process, we dropped the learning rate to 10?5 after 15 epochs and to 10?6 after another 15, training for a total of 40 epochs. We also applied random augmentation: flipping, rotation (from ?50o to 50o ), color jittering, scale noise (from 0.8 to 1.2) and random occlusion. The 2D-to-3D-FAN model was trained by following a similar procedure increasing the amount of augmentation even further: rotation (from ?70o to 70o ) and scale (from 0.7 to 1.3). Additionally, the learning rate initially was set to 10?3 . All networks were implemented in Torch7 [9] and trained using rmsprop [37].

在我們所有的實(shí)驗(yàn)中，我們獨(dú)立地訓(xùn)練了三個(gè)不同的網(wǎng)絡(luò):2D-FAN、3D-FAN和2D-to-3DFAN。對(duì)于前兩個(gè)網(wǎng)絡(luò)，我們將初始學(xué)習(xí)率設(shè)置為10?4，并使用10個(gè)小批。在這個(gè)過(guò)程中，我們?cè)?5個(gè)時(shí)點(diǎn)之后將學(xué)習(xí)率降低到10 - 5，在15個(gè)時(shí)點(diǎn)之后將學(xué)習(xí)率降低到10 - 6，總共訓(xùn)練了40個(gè)時(shí)點(diǎn)。我們還應(yīng)用了隨機(jī)增強(qiáng):翻轉(zhuǎn)、旋轉(zhuǎn)(從?50o到50o)、顏色抖動(dòng)、尺度噪聲(從0.8到1.2)和隨機(jī)遮擋。2D-to-3D-FAN模型按照類似的程序進(jìn)行訓(xùn)練，進(jìn)一步增加增加量:旋轉(zhuǎn)(從?70o到70o)和縮放(從0.7到1.3)。此外，學(xué)習(xí)速率最初設(shè)置為10?3。所有網(wǎng)絡(luò)在Torch7[9]中實(shí)現(xiàn)，使用rmsprop[37]進(jìn)行訓(xùn)練。

5. 2D face alignment

This Section evaluates 2D-FAN (trained on 300-W-LP2D), on 300-W test set, 300-VW (both training and test sets), and Menpo (frontal subset). Overall, 2D-FAN is evaluated on more than 220,000 images. Prior to reporting our results, the following points need to be emphasized:

1. 300-W-LP-2D contains a wide range of poses (yaw angles in [?90? , 90? ]), yet it is still a synthetically generated dataset as this wide spectrum of poses were produced by warping the nearly frontal images of the 300- W dataset. It is evident that this lack of real data largely increases the difficulty of the experiment.
2. The 2D landmarks of 300-W-LP-2D that 2D-FAN was trained on are slightly different from the 2D landmarks of the 300-W test set, 300-VW and Menpo. To alleviate this, the 2D-FAN was further fine-tuned on the original 300-W training set for a few epochs. Although this seems to resolve the issue, this discrepancy obviously increases the difficulty of the experiment.
3. We compare the performance of 2D-FAN on all the aforementioned datasets with that of an unconventional baseline: the performance of a recent state-of-the-art method, namely MDM [40] on LFPW test set, initialized with the ground truth bounding boxes. We call this result MDM-on-LFPW. As there is very little performance progress made on the frontal dataset of LFPW over the past years, we assume that a state-of-the-art method like MDM (nearly) saturates it. Hence, we use the produced error curve to compare how well our method does on the much more challenging aforementioned test sets.

Figure 3: Fittings with the highest error from 300-VW (NME 6.8-7%). Red: ground truth. White: our predictions. In most cases, our predictions are more accurate than the ground truth.

本節(jié)評(píng)估2D-FAN(在300-W- lp2d上進(jìn)行訓(xùn)練)、300-W測(cè)試集、300-VW(包括訓(xùn)練和測(cè)試集)和Menpo(正面子集)。總的來(lái)說(shuō)，2D-FAN在超過(guò)22萬(wàn)張圖片上進(jìn)行了評(píng)估。在報(bào)告我們的結(jié)果之前，需要強(qiáng)調(diào)以下幾點(diǎn):

1. 300-W- lv - 2d包含了廣泛的姿態(tài)(偏航角度在[?90?，90?])，但它仍然是一個(gè)綜合生成的數(shù)據(jù)集，因?yàn)檫@一廣泛的姿態(tài)是通過(guò)扭曲近正面圖像的300-W數(shù)據(jù)集產(chǎn)生的。很明顯，缺乏真實(shí)數(shù)據(jù)在很大程度上增加了實(shí)驗(yàn)的難度。
2. 2D- fan所訓(xùn)練的300w - lp -2D的2D地標(biāo)與300w測(cè)試集、300vw和Menpo的2D地標(biāo)略有不同。為了緩解這個(gè)問(wèn)題，2D-FAN在最初的300-W訓(xùn)練集上進(jìn)行了一些改進(jìn)。雖然這似乎解決了問(wèn)題，但這種差異明顯增加了實(shí)驗(yàn)的難度。
3.我們將2D-FAN在上述所有數(shù)據(jù)集上的性能與非傳統(tǒng)基線的性能進(jìn)行了比較:使用ground truth邊界框初始化的最新技術(shù)方法，即LFPW測(cè)試集上的MDM[40]的性能。我們稱這個(gè)結(jié)果為MDM-on-LFPW。由于LFPW的正面數(shù)據(jù)集在過(guò)去幾年中幾乎沒(méi)有取得什么性能進(jìn)步，所以我們假設(shè)MDM(幾乎)等最先進(jìn)的方法使其達(dá)到飽和。因此，我們使用產(chǎn)生的誤差曲線來(lái)比較我們的方法在前面提到的更具挑戰(zhàn)性的測(cè)試集上的表現(xiàn)。

圖3:300-VW的配件誤差最大(NME 6.8-7%)。紅色:地面真理。白:我們的預(yù)測(cè)。在大多數(shù)情況下，我們的預(yù)測(cè)比事實(shí)更準(zhǔn)確。

The cumulative error curves for our 2D experiments on 300-VW, 300-W test set and Menpo are shown in Fig. 8. We additionally report the performance of MDM on all datasets initialized by ground truth bounding boxes, ICCR, the stateof-the-art face tracker of [31], on 300-VW (the only tracking dataset), and our unconventional baseline (called MDMon-LFPW). Comparison with a number of methods in terms of AUC are also provided in Table 2.

With the exception of Category C of 300-VW, it is evident that 2D-FAN achieves literally the same performance on all datasets, outperforming MDM and ICCR, and, notably, matching the performance of MDM-on-LFPW. Out?of 7,200 images (from Menpo and 300-W test set), there are in total only 18 failure cases, which represent 0.25% of the images (we consider a failure a fitting with NME > 7%). After removing these cases, the 8 fittings with the highest error for each dataset are shown in Fig. 4.

Figure 4: Fittings with the highest error from 300-W test set (first row) and Menpo (second row) (NME 6.5-7%). Red: ground truth. White: our predictions. In most cases, our predictions are more accurate than the ground truth.

我們?cè)?00vw、300w和Menpo上進(jìn)行的二維實(shí)驗(yàn)累積誤差曲線如圖8所示。此外，我們還報(bào)告了MDM在所有數(shù)據(jù)集上的性能，這些數(shù)據(jù)集由ground truth包圍盒(ICCR，[31]的最先進(jìn)的面跟蹤器)在300-VW(唯一的跟蹤數(shù)據(jù)集)上初始化，以及我們的非常規(guī)基線(稱為MDMon-LFPW)。表2還提供了在AUC方面與一些方法的比較。

除了300-VW的C類，2D-FAN在所有數(shù)據(jù)集上的性能都是一樣的，優(yōu)于MDM和ICCR，特別是在性能上與MDM-on- lfpw相當(dāng)。在7200張圖像(來(lái)自Menpo和300-W測(cè)試集)中，總共只有18張失敗案例，占圖像的0.25%(我們認(rèn)為失敗是與NME > 7%的擬合)。去除這些情況后，每個(gè)數(shù)據(jù)集誤差最大的8個(gè)配件如圖4所示。

圖4:在300-W的測(cè)試裝置(第一行)和門珀(第二行)(NME 6.5-7%)中，誤差最大的配件。紅色:地面真理。白:我們的預(yù)測(cè)。在大多數(shù)情況下，我們的預(yù)測(cè)比事實(shí)更準(zhǔn)確。

Regarding the Category C of 300-VW, we found that the main reason for this performance drop is the quality of the annotations which were obtained in a semi-automatic manner. After removing all failure cases (101 frames representing 0.38% of the total number of frames), Fig. 3 shows the quality of our predictions vs the ground truth landmarks for the 8 fittings with the highest error for this dataset. It is evident that in most cases our predictions are more accurate. Conclusion: Given that 2D-FAN matches the performance of MDM-on-LFPW, we conclude that 2D-FAN achieves near saturating performance on the above 2D datasets. Notably, this result was obtained by training 2D-FAN primarily on synthetic data, and there was a mismatch between training and testing landmark annotations.

對(duì)于300-VW的C類，我們發(fā)現(xiàn)性能下降的主要原因是半自動(dòng)獲得的標(biāo)注質(zhì)量問(wèn)題。在刪除了所有的失敗案例(101幀代表所有幀數(shù)的0.38%)之后，圖3顯示了我們的預(yù)測(cè)的質(zhì)量與該數(shù)據(jù)集的8個(gè)誤差最大的裝置的地面真實(shí)地標(biāo)。很明顯，在大多數(shù)情況下，我們的預(yù)測(cè)更準(zhǔn)確。結(jié)論:在2D- fan性能與MDM-on-LFPW匹配的情況下，我們認(rèn)為2D- fan在上述2D數(shù)據(jù)集上的性能接近飽和。值得注意的是，這一結(jié)果主要是基于合成數(shù)據(jù)對(duì)2D-FAN進(jìn)行訓(xùn)練得到的，訓(xùn)練與測(cè)試地標(biāo)標(biāo)注之間存在不匹配。

6. Large Scale 3D Faces in-the-Wild dataset

Motivated by the scarcity of 3D face alignment annotations and the remarkable performance of 2D-FAN, we opted to create a large scale 3D face alignment dataset by converting all existing 2D face alignment annotations to 3D. To this end, we trained a 2D-to-3D FAN as described in Subsection 4.2 and guided it using the predictions of 2D-FAN, creating 3D landmarks for: 300-W test set, 300-VW (both training and all 3 testing datasets), Menpo (the whole dataset).

Evaluating 2D-to-3D is difficult: the only available 3D face alignment in-the-wild dataset (not used for training) is AFLW2000-3D [50]. Hence, we applied our pipeline (consisting of applying 2D-FAN for producing the 2D landmarks and then 2D-to-3D FAN for converting them to 3D) on AFLW2000-3D and then calculated the error, shown in Fig. 5 (note that for normalization purposes, 2D bounding?box annotations are still used). The results show that there is discrepancy between our 3D landmarks and the ones provided by [50]. After removing a few failure cases (19 in total, which represent 0.9% of the data), Fig. 6 shows 8 images with the highest error between our 3D landmarks and the ones of [50]. It is evident, that this discrepancy is mainly caused from the semi-automatic annotation pipeline of [50] which does not produce accurate landmarks especially for images with difficult poses.

由于3D人臉對(duì)齊注釋的稀缺性和2D- fan的卓越性能，我們選擇通過(guò)將所有現(xiàn)有的2D人臉對(duì)齊注釋轉(zhuǎn)換為3D來(lái)創(chuàng)建一個(gè)大規(guī)模的3D人臉對(duì)齊數(shù)據(jù)集。為此，我們按照4.2小節(jié)的描述培訓(xùn)了一個(gè)2D-to-3D風(fēng)扇，并使用2D-FAN的預(yù)測(cè)指導(dǎo)它，創(chuàng)建了3D地標(biāo):300-W測(cè)試集、300-VW(包括培訓(xùn)和所有3個(gè)測(cè)試數(shù)據(jù)集)、Menpo(整個(gè)數(shù)據(jù)集)。

評(píng)估2D-to-3D是困難的:野外唯一可用的3D人臉對(duì)準(zhǔn)數(shù)據(jù)集(不用于培訓(xùn))是AFLW2000-3D[50]。因此，我們?cè)贏FLW2000-3D上應(yīng)用了我們的管道(包括應(yīng)用2D-FAN生成2D地標(biāo)，然后應(yīng)用2D-to-3D FAN將它們轉(zhuǎn)換成3D)，然后計(jì)算了誤差，如圖5所示(注意，出于規(guī)范化目的，仍然使用2D包圍框注釋)。結(jié)果表明，我們的三維地標(biāo)與[50]提供的有差異。在排除了一些失敗案例(總共19個(gè)，占數(shù)據(jù)的0.9%)之后，圖6顯示了8張我們的3D地標(biāo)和[50]的地標(biāo)之間的誤差最大的圖像。很明顯，這一差異主要是由于[50]的半自動(dòng)標(biāo)注流水線無(wú)法生成準(zhǔn)確的地標(biāo)，特別是對(duì)于姿態(tài)困難的圖像。

Table 2: AUC (calculated for a threshold of 7%) on all major 2D face alignment datasets. MDM, CFSS and TCDCN were evaluated using ground truth bounding boxes and the openly available code.

表2:所有主要的2D人臉比對(duì)數(shù)據(jù)集的AUC(閾值為7%)。MDM、CFSS和TCDCN使用ground truth包圍盒和公開(kāi)可用的代碼進(jìn)行評(píng)估。

Figure 5: NME on AFLW2000-3D, between the original annotations of [50] and the ones generated by 2D-to-3DFAN. The error is mainly introduced by the automatic annotation process of [50]. See Fig. 6 for visual examples.

Figure 6: Fittings with the highest error from AFLW2000- 3D (NME 7-8%). Red: ground truth from [50]. White: predictions of 2D-to-3D-FAN. In most cases, our predictions are more accurate than the ground truth.
圖5:AFLW2000-3D上的NME，在[50]的原始注釋和2D-to-3DFAN生成的注釋之間。誤差主要是由[50]的自動(dòng)標(biāo)注過(guò)程引起的。圖6為可視化示例。

圖6:裝具誤差最高的AFLW2000- 3D (NME 7-8%)。紅色:地面真相來(lái)自[50]。白色:2d - 3d風(fēng)扇的預(yù)測(cè)。在大多數(shù)情況下，我們的預(yù)測(cè)比事實(shí)更準(zhǔn)確。

By additionally including AFLW2000-3D into the aforementioned datasets, overall, ~230,000 images were annotated in terms of 3D landmarks leading to the creation of the Large Scale 3D Faces in-the-Wild dataset (LS3D-W), the largest 3D face alignment dataset to date.

通過(guò)將AFLW2000-3D添加到上述數(shù)據(jù)集中，總的來(lái)說(shuō)，約23萬(wàn)張圖像被標(biāo)注了3D地標(biāo)，從而創(chuàng)建了迄今為止最大的3D人臉比對(duì)數(shù)據(jù)集——野外大型3D人臉數(shù)據(jù)集(LS3D-W)。

7. 3D face alignment

This Section evaluates 3D-FAN trained on 300-W-LP3D, on LS3D-W (described in the previous Section) i.e. on the 3D landmarks of the 300-W test set, 300-VW (both training and test sets), and Menpo (the whole dataset) and AFLW2000-3D (re-annotated). Overall, 3D-FAN is evaluated on ~230,000 images. Note that compared to the 2D experiments reported in Section 5, more images in large poses have been used as our 3D experiments also include AFLW2000-3D and the profile images of Menpo (~2000 more images in total). The results of our 3D face alignment experiments on 300-W test set, 300-VW, Menpo and AFLW2000-3D are shown in Fig. 9. We additionally report the performance of the state-of-the-art method of 3DDFA (trained on the same dataset as 3D-FAN) on all datasets. Conclusion: 3D-FAN essentially produces the same accuracy on all datasets largely outperforming 3DDFA. This accuracy is slightly increased compared to the one achieved by 2D-FAN, especially for the part of the error curve for which the error is less than 2% something which is not surprising as now the training and testing datasets are annotated using the same mark-up.

真實(shí)感三維人臉對(duì)齊

本節(jié)將評(píng)估3D- fan在300-W- lp3d、LS3D-W(在前一節(jié)中描述)上的訓(xùn)練，即在300-W測(cè)試集、300-VW(包括訓(xùn)練和測(cè)試集)、Menpo(整個(gè)數(shù)據(jù)集)和AFLW2000-3D(重新注釋)上的3D地標(biāo)上的訓(xùn)練。總的來(lái)說(shuō)，3D-FAN在大約23萬(wàn)張圖像上進(jìn)行了評(píng)估。注意，與第5節(jié)報(bào)道的2D實(shí)驗(yàn)相比，我們的3D實(shí)驗(yàn)使用了更多的大位姿圖像，包括AFLW2000-3D和Menpo的profile images(總共多了約2000張圖像)。我們?cè)?00w試驗(yàn)臺(tái)上、300vw、Menpo和AFLW2000-3D上進(jìn)行的三維人臉對(duì)準(zhǔn)實(shí)驗(yàn)結(jié)果如圖9所示。此外，我們還報(bào)告了3DDFA(在與3D-FAN相同的數(shù)據(jù)集上訓(xùn)練)的最新方法在所有數(shù)據(jù)集上的性能。結(jié)論:3D-FAN在所有數(shù)據(jù)集上產(chǎn)生的準(zhǔn)確性基本上都優(yōu)于3DDFA。與2D-FAN相比，這種精度略有提高，特別是對(duì)于誤差小于2%的誤差曲線部分，這并不奇怪，因?yàn)楝F(xiàn)在訓(xùn)練和測(cè)試數(shù)據(jù)集都使用相同的標(biāo)記進(jìn)行注釋。

8. Ablation studies

To further investigate the performance of 3D-FAN under challenging conditions, we firstly created a dataset of 7,200 images from LS3D-W so that there is an equal number of images in yaw angles [0o ? 30o ], [30o ? 60o ] and [60o ? 90o ]. We call this dataset LS3D-W Balanced. Then, we conducted the following experiments:	為了進(jìn)一步研究3D-FAN在復(fù)雜條件下的性能，我們首先創(chuàng)建了一個(gè)包含來(lái)自LS3D-W的7200張圖像的數(shù)據(jù)集，以便在偏航角[0o?30o]、[30o?60o]和[60o?90o]有相同數(shù)量的圖像。我們稱這個(gè)數(shù)據(jù)集為L(zhǎng)S3D-W平衡。然后，我們進(jìn)行了以下實(shí)驗(yàn):
Table 3: AUC (calculated for a threshold of 7%) on the LS3D-W Balanced for different yaw angles. Table 4: AUC on the LS3D-W Balanced for different levels of initialization noise. The network was trained with a noise level of up to 20%. Table 5: AUC on the LS3D-W Balanced for various network sizes. Between 12-24M parameters, performance remains almost the same. Figure 7: AUC on the LS3D-W Balanced for different face resolutions. Up to 30px, performance remains high. 表3:不同偏航角度下LS3D-W平衡時(shí)的AUC(閾值為7%)。表4:在不同初始化噪聲水平下，LS3D-W上的AUC平衡。該網(wǎng)絡(luò)在20%的噪音水平下進(jìn)行訓(xùn)練。表5:LS3D-W上的AUC為各種網(wǎng)絡(luò)大小而平衡。在12-24M參數(shù)之間，性能幾乎保持不變。圖7:在LS3D-W上的AUC為不同的面分辨率而平衡。高達(dá)30px，性能仍然很高。
Performance across pose. We report the performance of 3D-FAN on LS3D-W Balanced for each pose separately in terms of the Area Under the Curve (AUC) (calculated for a threshold of 7%) in Table 3. We observe only a slight degradation of performance for very large poses ([60o ? 90o ]). We believe that this is to some extent to be expected as 3D-FAN was largely trained with synthetic data?for these poses (300-W-LP-3D). This data was produced by warping frontal images (i.e. the ones of 300-W) to very large poses which causes face distortion especially for the face region close to the ears. Conclusion: Facial pose is not a major issue for 3D-FAN. Performance across resolution. We repeated the previous?experiment but for different face resolutions (resolution is reduced relative to the face size defined by the tight bounding box) and report the performance of 3D-FAN in terms of AUC in Fig. 7. Note that we did not retrain 3D-FAN to particularly work for such low resolutions. We observe significant performance drop for all poses only when the face size is as low as 30 pixels. Conclusion: Resolution is not a major issue for 3D-FAN. Performance across noisy initializations. For all reported results so far, we used 10% of noise added to the ground truth bounding boxes. Note that 3D-FAN was trained with noise level of 20%. Herein, we repeated the previous experiment but for different noise levels and report the performance of 3D-FAN in terms of AUC in Table 4. We observe only small performance decrease for noise level?equal to 30% which is greater than the level of noise that the network was trained with. Conclusion: Initialization is not a major issue for 3D-FAN.	在構(gòu)成性能。根據(jù)曲線下面積(AUC)(閾值為7%)，我們報(bào)告了3D-FAN在LS3D-W上的表現(xiàn)。我們只觀察到在非常大的位姿([60o - 90o])下性能有輕微的下降。我們認(rèn)為這在某種程度上是可以預(yù)期的，因?yàn)?D-FAN主要是使用這些姿勢(shì)的合成數(shù)據(jù)進(jìn)行訓(xùn)練的(300-W-LP-3D)。這些數(shù)據(jù)是通過(guò)將正面圖像(即300-W的圖像)扭曲成非常大的姿態(tài)產(chǎn)生的，這會(huì)導(dǎo)致面部變形，尤其是靠近耳朵的面部區(qū)域。結(jié)論:面部姿勢(shì)不是3d風(fēng)扇的主要問(wèn)題。在分辨率性能。我們重復(fù)了之前的實(shí)驗(yàn)，但是對(duì)于不同的人臉?lè)直媛?分辨率相對(duì)于緊邊界框定義的人臉尺寸減小)，并在圖7中以AUC的形式報(bào)告了3D-FAN的性能。請(qǐng)注意，我們并沒(méi)有對(duì)3D-FAN進(jìn)行再培訓(xùn)，使其特別適合如此低分辨率的工作。我們觀察到，只有當(dāng)面部尺寸低至30像素時(shí)，所有姿勢(shì)的表現(xiàn)才會(huì)顯著下降。結(jié)論:解決問(wèn)題并不是3D-FAN的主要問(wèn)題。跨有噪聲的初始化的性能。對(duì)于到目前為止所有報(bào)告的結(jié)果，我們使用10%的噪音添加到地面真實(shí)邊界框。注意3D-FAN的訓(xùn)練噪音等級(jí)為20%。在此，我們重復(fù)了之前的實(shí)驗(yàn)，但是在不同的噪聲水平下，我們用AUC來(lái)報(bào)告3D-FAN的性能如表4所示。我們觀察到，當(dāng)噪聲水平大于網(wǎng)絡(luò)訓(xùn)練時(shí)的噪聲水平時(shí)，當(dāng)噪聲水平等于30%時(shí)，系統(tǒng)的性能只有很小的下降。結(jié)論:初始化不是3D-FAN的主要問(wèn)題。 ?
Performance across different network sizes. For all reported results so far, we used a very powerful 3D-FAN with 24M parameters. Herein, we repeated the previous experiment varying the number of network parameters and report the performance of 3D-FAN in terms of AUC in Table 5. The number of parameters is varied by firstly reducing the number of HG networks used from 4 to 1. Then, the number of parameters was dropped further by reducing the number of channels inside the building block. It is important to note that even then biggest network is able to run on 28-30 fps on a TitanX GPU while the smallest one can reach 150 fps. We observe that up to 12M, there is only a small performance drop and that the network’s performance starts to drop significantly only when the number of parameters becomes as low as 6M. Conclusion: There is a moderate performance drop vs the number of parameters of 3D-FAN. We believe that this is an interesting direction for future work.	跨不同網(wǎng)絡(luò)大小的性能。到目前為止，所有報(bào)告的結(jié)果，我們使用了一個(gè)非常強(qiáng)大的24米參數(shù)的3d風(fēng)扇。在此，我們重復(fù)了之前的實(shí)驗(yàn)，改變網(wǎng)絡(luò)參數(shù)的個(gè)數(shù)，并以AUC的形式報(bào)告了3D-FAN的性能如表5所示。參數(shù)的數(shù)量是變化的，首先減少使用的4到1的HG網(wǎng)絡(luò)的數(shù)量。然后，通過(guò)減少構(gòu)建塊內(nèi)的通道數(shù)量進(jìn)一步減少參數(shù)的數(shù)量。值得注意的是，即使是最大的網(wǎng)絡(luò)也可以在TitanX GPU上運(yùn)行28-30幀每秒，而最小的可以達(dá)到150幀每秒。我們觀察到，在高達(dá)12M的情況下，只有很小的性能下降，并且只有當(dāng)參數(shù)的數(shù)量低至6M時(shí)，網(wǎng)絡(luò)的性能才開(kāi)始顯著下降。結(jié)論:隨著3D-FAN參數(shù)的增加，性能有一定程度的下降。我們認(rèn)為這是未來(lái)工作的一個(gè)有趣的方向。

9. Conclusions

We constructed a state-of-the-art neural network for landmark localization, trained it for 2D and 3D face alignment, and evaluate it on hundreds of thousands of images. Our results show that our network nearly saturates these datasets, showing also remarkable resilience to pose, resolution, initialization, and even to the number of the network parameters used. Although some very unfamiliar poses were not explored in these datasets, there is no reason to believe, that given sufficient data, the network does not have the learning capacity to accommodate them, too.

我們構(gòu)建了一個(gè)最先進(jìn)的用于地標(biāo)定位的神經(jīng)網(wǎng)絡(luò)，對(duì)其進(jìn)行了二維和三維人臉對(duì)準(zhǔn)訓(xùn)練，并對(duì)數(shù)十萬(wàn)張圖像進(jìn)行了評(píng)估。我們的結(jié)果表明，我們的網(wǎng)絡(luò)幾乎飽和了這些數(shù)據(jù)集，在姿態(tài)、分辨率、初始化，甚至使用的網(wǎng)絡(luò)參數(shù)的數(shù)量方面也表現(xiàn)出了顯著的彈性。雖然在這些數(shù)據(jù)集中沒(méi)有探索一些非常不熟悉的姿勢(shì)，但沒(méi)有理由相信，給定足夠的數(shù)據(jù)，網(wǎng)絡(luò)也不具備容納它們的學(xué)習(xí)能力。

10. Acknowledgments?

Adrian Bulat was funded by a PhD scholarship from the University of Nottingham. This work was supported in part by the EPSRC project EP/M02153X/1 Facial Deformable Models of Animals.

Adrian Bulat獲得了諾丁漢大學(xué)的博士獎(jiǎng)學(xué)金。該工作部分得到了EPSRC項(xiàng)目EP/M02153X/1面部可變形動(dòng)物模型的支持。

《新程序員》：云原生和全面數(shù)字化實(shí)踐50位技術(shù)專家共同創(chuàng)作，文字、視頻、音頻交互閱讀

總結(jié)

以上是生活随笔為你收集整理的Paper：《How far are we from solving the 2D 3D Face Alignment problem? 》解读与翻译的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： PuTTy：PuTTy的简介、安装、使用
下一篇： Py之terminaltables：te