人物肖像速写_深度视频肖像
人物肖像速寫
Synthesizing and editing video portraits—i.e., videos framed to show a person’s head and upper body—is an important problem in computer graphics, with applications in video editing and movie postproduction, visual effects, visual dubbing, virtual reality, and telepresence, among others.
合成和編輯視頻肖像(即,成幀顯示人的頭部和上半身的視頻)是計算機圖形學中的一個重要問題,在視頻編輯和電影后期制作,視覺效果,配音,虛擬現實和遠程呈現等應用中。
The problem of synthesizing a photo-realistic video portrait of a target actor that mimics the actions of a source actor—and especially where the source and target actors can be different subjects—is still an open problem.
合成模仿源演員的動作的目標演員的逼真的視頻肖像的問題(尤其是在源演員和目標演員可以是不同主體的情況下)仍然是一個懸而未決的問題。
There hasn’t been an approach that enables one to take full control of the rigid head pose, face expressions, and eye motion of the target actor; even face identity can be modified to some extent. Until now.
還沒有一種方法可以完全控制目標演員的硬頭姿勢,面部表情和眼睛動作。 甚至人臉的身份都可以在一定程度上進行修改。 到現在。
In this post, I’m going to review “Deep Video Portraits”, which presents a novel approach that enables photo-realistic re-animation of portrait videos using only an input video.
在這篇文章中,我將回顧“ Deep Video Portraits” ,它提出了一種新穎的方法,該方法僅使用輸入視頻就可以對肖像視頻進行逼真的重新動畫處理。
In this post, I’ll cover two things: First, a short definition of a DeepFake. Second, an overview of the paper “Deep Video Portraits” in the words of the authors.
在本文中,我將介紹兩件事:首先,是DeepFake的簡短定義。 其次,用作者的話概述“深度視頻肖像”一文。
1.定義DeepFakes (1. Defining DeepFakes)
The word DeepFake combines the terms “deep learning” and “fake”, and refers to manipulated videos or other digital representations that produce fabricated images and sounds that appear to be real but have in fact been generated by deep neural networks.
DeepFake這個詞結合了“深度學習”和“偽造”兩個術語,指的是經過操縱的視頻或其他數字表示形式,它們產生的偽造圖像和聲音看上去是真實的,但實際上是由深度神經網絡生成的。
2.深度視頻肖像 (2. Deep Video Portraits)
2.1概述 (2.1 Overview)
The core method presented in the paper provides full control over the head of a target actor by transferring the rigid head pose, facial expressions, and eye motion of a source actor, while preserving the target’s identity and appearance.
本文中介紹的核心方法是通過傳遞源頭演員的硬頭姿勢,面部表情和眼睛動作來完全控制目標演員的頭部,同時保留目標的身份和外觀。
On top of that, full video of the target is synthesized, including consistent upper body posture, hair, and background.
最重要的是,可以合成目標的完整視頻,包括一致的上身姿勢,頭發和背景。
Figure 1. Facial reenacement results from “DVP”. Expressions from the source are transferred from source to target actor, while retaining the head pose (rotation and translation) as well as the eye gaze of the target actor圖1.“ DVP”的面部重演結果。 源中的表情從源轉移到目標演員,同時保留頭部姿勢(旋轉和平移)以及目標演員的目光The overall architecture of the paper’s framework is illustrated below in Figure 2.
本文框架的總體架構如下圖2所示。
First, the source and target actors are being tracked using a state-of-the-art face reconstruction approach from a single image, and a 3D morphable model (3DMM) is derived to best fit the source and target actors.
首先,使用最先進的人臉重構方法從單個圖像中跟蹤源角色和目標角色,然后導出3D可變形模型(3DMM)以最適合源角色和目標角色。
The resulting sequence of low-dimensional parameter vectors represents the actor’s identity, head pose, expression, eye gaze, and the scene lighting for every video frame.
所得的低維參數向量序列表示每個視頻幀的演員身份,頭部姿勢,表情,視線和場景照明。
Then, the head pose, expressions and/or eye gaze parameters from the source are taken and mixed with the illumination and identity parameters of the target. This allows the network to generate a full-head reenactment while preserving the actor’s identity and look.
然后,從源頭獲取頭部姿勢,表情和/或視線參數,并將其與目標的照明和標識參數混合。 這允許網絡在保留演員的身份和外觀的同時生成完整的重新制定。
Next, new synthetic renderings of the target actor are generated based on the mixed parameters. These renderings are the input to the paper’s novel “rendering-to-video translation network”, which is trained to convert the synthetic input into photo-realistic output.
接下來,基于混合參數生成目標演員的新合成渲染。 這些渲染是論文新穎的“視頻渲染翻譯網絡”的輸入,該網絡經過訓練可以將合成的輸入轉換為逼真的輸出。
Figure. 2. Deep video portraits enable a source actor to fully control a target video portrait. First, a low-dimensional parametric representation (let) of both videos is obtained using monocular face reconstruction. The head pose, expression and eye gaze can now be transferred in parameter space (middle). Finally, render conditioning input images that are converted to a photo-realistic video portrait of the target actor (right). Obama video courtesy of the White House (public domain)數字。 2.深視頻肖像使源Actor能夠完全控制目標視頻肖像。 首先,使用單眼人臉重建獲得兩個視頻的低維參數表示(let)。 現在可以在參數空間(中間)中傳輸頭部姿勢,表情和視線。 最后,渲染條件輸入圖像,將其轉換為目標演員的逼真的視頻肖像(右)。 奧巴馬的視頻由白宮提供(公共領域)2.2從單個圖像重建臉部 (2.2 Face Reconstruction from a single image)
3D morphable models are used for face analysis because the intrinsic properties of 3D faces provide a representation that’s immune to intra-personal variations, such as pose and illumination. Given a single facial input image, a 3DMM can recover 3D face (shape and texture) and scene properties (pose and illumination) via a fitting process.
3D可變形模型用于人臉分析,因為3D人臉的內在屬性可提供不受人體內部變化(例如姿勢和光照)影響的表示。 給定單個面部輸入圖像,3DMM可以通過擬合過程恢復3D面部(形狀和紋理)和場景屬性(姿勢和照明)。
The authors employ a state-of-the-art dense face reconstruction approach that fits a parametric model of the face and illumination to each video frame. It obtains a meaningful parametric face representation for both the source and the target, given an input video sequence.
作者采用了最先進的密集面部重建方法,該方法將面部和照明的參數模型擬合到每個視頻幀。 在給定輸入視頻序列的情況下,它為源和目標都獲得了有意義的參數面部表示。
Equation 1. source actor video sequence where N_s denotes the total number of source frames.公式1.源演員視頻序列,其中N_s表示源幀的總數。The meaningful parametric face representation consists of a set of parameters P. , which could be denoted as the corresponding parameter sequence that fully describes the source or target facial performance.
有意義的參數人臉表示由一組參數P組成 。 ,可以表示為完整描述來源或目標面部效果的相應參數序列。
Equation 2. A meaningful parametric face representation best describes each frame in the input video sequence.公式2。有意義的參數面部表示可以最好地描述輸入視頻序列中的每個幀。The set of reconstructed parameters P encode the rigid head pose, facial identity coefficients, expressions coefficients, gaze direction for both eyes, and spherical harmonics illumination coefficients. Overall, the face reconstruction process estimates 261 parameters per video frame.
重建參數集P編碼了剛性頭部的姿勢,面部識別系數,表情系數,兩只眼睛的注視方向以及球諧照明度系數。 總體而言,人臉重建過程估計每個視頻幀261個參數。
Below are more details on the parametric face representation and the fitting process.
以下是有關參數化面部表示和擬合過程的更多詳細信息。
2.2.1 Parametric Face Representation
2.2.1參數人臉表示
The paper represents the space of facial identity based on a parametric head model, and the space of facial expressions via an affine model. Mathematically, they model geometry variation through an affine model v∈ R^(3N) that stacks per-vertex deformations of the underlying template mesh with N vertices, as follows:
本文基于參數化頭部模型來表示面部識別的空間,并通過仿射模型來表示面部表情的空間。 在數學上,他們通過仿射模型v∈R ^(3N)對幾何變化進行建模,該仿射模型堆疊具有N個頂點的基礎模板網格的每個頂點變形,如下所示:
Equation 3. per-vertex deformations of the underlying template mesh with N vertices公式3.具有N個頂點的基礎模板網格的每個頂點變形Where a_{geo} ∈ R^(3N) stores the average facial geometry. The geometry bases b_k for the geometry has been computed by applying principal component analysis (PCA) to 200 high-quality face scans, and b_k for the expressions has been obtained in the same manner on blendshapes.
其中a_ {geo}∈R ^(3N)存儲平均面部幾何形狀。 通過將主成分分析(PCA)應用于200個高質量的面部掃描,可以計算出幾何的幾何基數b_k,并且在混合形狀上以相同的方式獲得了表達式的b_k。
2.2.2 Image Formation Model
2.2.2圖像形成模式 l
To render synthetic head images, a full perspective camera is assumed that maps model-space 3D points v via camera space to 2D points on the image plane. The perspective mapping Π contains the multiplication with the camera intrinsics and the perspective division.
為了渲染合成頭部圖像,假定使用全透視相機,將通過相機空間將模型空間3D點v映射到圖像平面上的2D點。 透視圖映射Π包含與相機內在函數的乘積和透視圖除法。
In addition, based on a distant illumination assumption, spherical harmonics basis functions are used to approximate the incoming radiance B from the environment.
另外,基于遠處的照明假設,使用球諧函數基函數來估計來自環境的入射輻射B。
Equation 4. A spherical harmonics basis functions are used to approximate the incoming radiance B from the environment公式4。球諧函數基函數用于從環境近似估算入射輻射BWhere B is the number of spherical harmonics bands, ?_b the spherical harmonics coefficients, and r_i and n_i the reflectance and unit normal vector of the i-th vertex, respectively.
其中B是球諧頻帶的數量,?_b是球諧系數,r_i和n_i分別是第i個頂點的反射率和單位法向矢量。
2.3綜合條件輸入 (2.3 Synthetic Conditioning Input)
Using the face reconstruction approach described above, a face is reconstructed in each frame of the source and target video. Next, the rigid head pose, expression, and eye gaze of the target actor is modified. All parameters are copied in a relative manner from the source to the target.
使用上述的面部重構方法,在源視頻和目標視頻的每個幀中重構面部。 接下來,修改目標演員的硬頭姿勢,表情和視線。 所有參數都以相對方式從源復制到目標。
Then the authors render synthetic conditioning images of the target actor’s face model under the modified parameters using hardware rasterization.
然后,作者使用硬件光柵化在修改后的參數下渲染目標演員的面部模型的合成條件圖像。
For each frame, three different conditioning inputs are generated: a color rendering, a correspondence image, and an eye gaze image.
對于每一幀,將生成三個不同的條件輸入:彩色渲染,對應圖像和視線圖像。
Figure 3. The synthetic input used for conditioning the rendering-to-video translation network: (1) colored face rendering under target illumination, (2) correspondence image, and (3) the eye gaze image圖3.用于調節渲染視頻轉換網絡的合成輸入:(1)在目標照明下的彩色面部渲染,(2)對應圖像,以及(3)視線圖像The color rendering shows the modified target actor model under the estimated target illumination, while keeping the target identity (geometry and skin reflectance) fixed. This image provides a good starting point for the following rendering-to-video translation, since in the face region only the delta to a real image has to be learned.
彩色渲染在估計的目標照明下顯示修改后的目標演員模型,同時保持目標身份(幾何形狀和皮膚反射率)固定。 該圖像為隨后的渲染到視頻的轉換提供了一個很好的起點,因為在面部區域中,只需學習真實圖像的增量即可。
A correspondence image encoding the index of the parametric model’s vertex that projects into each pixel is also rendered to keep the 3D information.
還渲染了編碼投影到每個像素中的參數模型頂點索引的對應圖像,以保留3D信息。
Finally, a gaze map is provided to provide information about the eye gaze direction and blinking.
最后,提供凝視圖以提供有關眼睛凝視方向和眨眼的信息。
All of the images are stacked to obtain the input to the rendering-to-video translation network.
將所有圖像堆疊在一起,以獲得輸入到視頻轉譯網絡的輸入。
2.4渲染到視頻翻譯 (2.4 Rendering-To-Video Translation)
The generated conditioning space-time stacked images are the input to the rendering-to-video translation network.
生成的條件時空疊加圖像是渲染到視頻翻譯網絡的輸入。
The network learns to convert the synthetic input into full frames of a photo-realistic target video, in which the target actor now mimics the head motion, facial expression, and eye gaze of the synthetic input.
網絡學習如何將合成輸入轉換為逼真的目標視頻的全幀,目標演員現在可以模仿合成輸入的頭部運動,面部表情和視線。
The network learns to synthesize the entire actor in the foreground, i.e., the face for which conditioning input exists, but also all other parts of the actor, such as hair and body, so that they comply with the target head pose.
網絡學習合成前景中的整個演員,即存在條件輸入的面部,以及演員的所有其他部分(例如頭發和身體),以使其符合目標頭部姿勢。
It also synthesizes the appropriately modified and filled-in background, even including some consistent lighting effects between the foreground and background.
它還可以合成經過適當修改和填充的背景,甚至包括前景和背景之間的一些一致的照明效果。
The network shown in Figure 4 follows an encoder-decoder architecture and is trained in an adversarial manner.
圖4所示的網絡遵循編碼器-解碼器架構,并以對抗性方式進行訓練。
Figure 4. Architecture of the rendering-to-video translation network follows an encoder-decoder architecture圖4.渲染視頻轉換網絡的架構遵循編碼器-解碼器架構The training objective function comprises a conditioned adversarial loss and L1 photometric loss.
訓練目標函數包括條件對抗損失和L1光度損失。
Equation 5. Rendering-To-Video Translation objective function公式5:渲染視頻轉換目標函數During adversarial training, the discriminator D tries to get better at classifying given images as real or synthetic, while the transformation network T tries to improve in fooling the discriminator. The L1 loss penalizes the distance between the synthesized image T(x) and the ground truth image Y, which encourages the sharpness of the synthesized output:
在對抗訓練中,鑒別器D試圖在將給定圖像分類為真實或合成圖像方面做得更好,而變換網絡T試圖在欺騙鑒別器方面進行改進。 L1損失會懲罰合成圖像T(x)與地面真實圖像Y之間的距離,這會鼓勵合成輸出的清晰度:
Equation 6. L1 loss公式6. L1損耗Deep learning is rapidly moving closer to where data is collected — edge devices. Subscribe to the Fritz AI Newsletter to learn more about this transition and how it can help scale your business.
深度學習正Swift向收集數據的地方(邊緣設備)靠近。 訂閱Fritz AI新聞簡報,以了解有關此過渡及其如何幫助您擴展業務的更多信息。
3.實驗與結果 (3. Experiments & Results)
This approach enables us to take full control of the rigid head pose, facial expression, and eye motion of a target actor in a video portrait, thus opening up a wide range of video rewrite applications.
這種方法使我們能夠完全控制視頻肖像中目標演員的硬頭姿勢,面部表情和眼睛動作,從而開辟了廣泛的視頻重寫應用程序。
3.1在完全掌控的情況下重演 (3.1 Reenactment under full head control)
This approach is the first that can photo-realistically transfer the full 3D head pose (spatial position and rotation), facial expression, as well as eye gaze and eye blinking of a captured source actor to a target actor video.
這種方法是第一種可以真實地將完整的3D頭部姿勢(空間位置和旋轉),面部表情以及捕獲的源演員的目光和眨眼轉移到目標演員視頻的方法。
Figure 5 shows some examples of full-head reenactment between different source and target actors. Here, the authors use the full target video for training and the source video as the driving sequence.
圖5顯示了不同源角色和目標角色之間的完整頭像重現的一些示例。 在這里,作者使用完整的目標視頻進行訓練,并使用源視頻作為驅動序列。
As can be seen, the output of their approach achieves a high level of realism and faithfully mimics the driving sequence, while still retaining the mannerisms of the original target actor.
可以看出,他們的方法的輸出實現了很高的真實感,并忠實地模仿了駕駛順序,同時仍然保留了原始目標演員的舉止。
Figure 5. qualitative results of full-head reenactment圖5.全頭重演的定性結果3.2面部重演和視頻配音 (3.2 Facial Reenactment and Video Dubbing)
Besides full-head reenactment, the approach also enables facial reenactment. In this experiment, the authors replaced the expression coefficients of the target actor with those of the source actor before synthesizing the conditioning input to the rendering-to-video translation network.
除了全頭重演外,該方法還可以進行面部重演。 在這個實驗中,作者在合成渲染到視頻翻譯網絡的條件輸入之前,將目標演員的表達系數替換為源演員的表達系數。
Here, the head pose and position and eye gaze remain unchanged. Figure 6 shows facial reenactment results.
在此,頭部的姿勢和位置以及視線保持不變。 圖6顯示了面部重現結果。
Figure 6. Facial reenactment results圖6.面部重演結果Video dubbing could also be applied by modifying the facial motion of actors speaking originally in another language to an ensign translation, spoken by a professional dubbing actor in a dubbing studio.
視頻配音也可以通過將最初用另一種語言說的演員的面部動作修改為少尉翻譯來進行,而配音則由配音工作室中的專業配音演員說出。
More precisely, the captured facial expressions of the dubbing actor could be transferred to the target actor, while leaving the original target gaze and eye blinks intact.
更準確地說,可以將捕獲的配音演員的面部表情轉移到目標演員,同時保持原始目標注視和眨眼完好無損。
4。討論 (4. Discussion)
In this post, I presented Deep Video Portraits, a novel approach that enables photo-realistic re-animation of portrait videos using only an input video.
在這篇文章中,我介紹了Deep Video Portraits,這是一種新穎的方法,可以僅使用輸入視頻就可以對肖像視頻進行逼真的重新動畫處理。
In contrast to existing approaches that are restricted to manipulations of facial expressions only, the authors are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor.
與僅限于面部表情操作的現有方法相比,作者是第一個將完整3D頭部位置,頭部旋轉,面部表情,眼睛凝視和眨眼從原始演員轉移到人像視頻的人。目標演員。
The authors have shown, through experiments and a user study, that their method outperforms prior work, both in terms of model performance and expanded capabilities. This opens doors to many applications, like video reenactment for virtual reality and telepresence, interactive video editing, and visual dubbing.
作者通過實驗和用戶研究表明,在模型性能和擴展功能方面,他們的方法均優于先前的工作。 這為許多應用打開了大門,例如用于虛擬現實和遠程呈現的視頻重現,交互式視頻編輯以及可視化配音。
5。結論 (5. Conclusions)
As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.
與往常一樣,如果您有任何疑問或意見,請隨時在下面留下您的反饋,或者隨時可以通過LinkedIn與我聯系。
Till then, see you in the next post! 😄
到那時,在下一篇文章中見! 😄
For the enthusiastic reader:For more details on “Deep Video Portraits” check out the formal project page or check out their video demo.
對于熱情的讀者: 有關 “深度視頻肖像”的 更多詳細信息, 請查看正式的項目頁面或 查看其視頻演示 。
Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to exploring the emerging intersection of mobile app development and machine learning. We’re committed to supporting and inspiring developers and engineers from all walks of life.
編者注: 心跳 是由貢獻者驅動的在線出版物和社區,致力于探索移動應用程序開發和機器學習的新興交集。 我們致力于為各行各業的開發人員和工程師提供支持和啟發。
Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine learning platform that helps developers teach devices to see, hear, sense, and think. We pay our contributors, and we don’t sell ads.
Heartbeat在編輯上是獨立的,由以下機構贊助和發布 Fritz AI ,一種機器學習平臺,可幫助開發人員教設備看,聽,感知和思考。 我們向貢獻者付款,并且不出售廣告。
If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter), join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.
如果您想做出貢獻,請繼續我們的 呼吁捐助者 。 您還可以注冊以接收我們的每周新聞通訊(《 深度學習每周》 和《 Fritz AI新聞通訊》 ),并加入我們 Slack ,然后繼續關注Fritz AI Twitter 提供了有關移動機器學習的所有最新信息。
翻譯自: https://heartbeat.fritz.ai/deep-video-portraits-f0f4a136546a
人物肖像速寫
總結
以上是生活随笔為你收集整理的人物肖像速写_深度视频肖像的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 塞尔达怎么去火山
- 下一篇: 轩辕传奇手游刺客攻略有哪些