That's What I Said: Fully-Controllable Talking Face Generation

Abstract

The goal of this paper is to synthesise talking faces with controllable facial motions by manipulating a latent space of a face generator. To achieve this goal, we propose two key ideas. The first is to establish a canonical space where every face has the same motion patterns but different identities. The second is to navigate a multimodal motion space that only represents motion-related features while eliminating identity. To disentangle identity and motion, we inject an orthogonality constraint between the two different latent spaces. From this, our method can generate natural-looking talking faces with fully controllable facial attributes and accurate lip synchronisation. Extensive experiments demonstrate that our method achieves state-of-the-art results in terms of both visual quality and lip-sync score. To the best of our knowledge, we are the first to develop a talking face generation framework that can accurately reflect target facial motions including lip, head pose, eyes, and even expressions in the generated video all at once without any additional supervision beyond RGB video with audio.

Overview

The key to our framework is two mapping operations: 1) Visual Space to Canonical Space, 2) Visual/Audio Space to Multimodal Motion Space. Through the first mapping, we obtain canonical images that have the same motion features and different identities. Meanwhile, the second mapping yields motion vectors that enable us to transfer desired motions onto canonical images. To ensure the disentanglement of the two subspaces, we impose an orthogonality constraint. Based on this process, our model is capable of generating talking faces that mimic the facial motion of the target.

Same-Identity Reconstruction

Cross-Identity Generation

Samples in Canonical Space

We demonstrate how well our model preserves identity by mapping various identities to the canonical space. In Fig. (a), we generate diverse canonical image samples having different identities by feeding each canonical code to our generator. In Fig. (b), we further visualise every canonical image from a single video. These results prove that our model is robust to maintain the source identities and well-generalised to various identities.

Comparison with SOTA

That's What I Said:Fully-Controllable Talking Face Generation

Our novel talking face generation framework precisely reflects every facial expression of the motion source while synchronising the lip shape with the input audio source. The key to our framework is to find the canonical space, where every face has the same motion patterns but different identities.