Let There Be Sound: Reconstructing High Quality Speech from Silent Videos

Abstract

The goal of this work is to reconstruct high quality speech from lip motions alone, a task also known as lip-to-speech. A key challenge of lip-to-speech systems is the one-to-many mapping caused by (1) the existence of homophenes and (2) multiple speech variations, resulting in a mispronounced and over-smoothed speech. In this paper, we propose a novel lip-to-speech system that significantly improves the generation quality by alleviating the one-to-many mapping problem from multiple perspectives. Specifically, we incorporate (1) self-supervised speech representations to disambiguate homophenes, and (2) acoustic variance information to model diverse speech styles. Additionally, to better solve the aforementioned problem, we employ a flow based post-net which captures and refines the details of the generated speech. We perform extensive experiments on two datasets, and demonstrate that our method achieves the generation quality close to that of real human utterance, outperforming existing methods in terms of speech naturalness and intelligibility by a large margin.

Samples of GRID Dataset

Sample 1

Transcription: "Place Red by I Eight Now."

Sample 2

Transcription: "Bin Blue in V Four Now."

Sample 3

Transcription: "Bin Green at Y Five Soon."

Samples of Lip2Wav Dataset

Sample 1

Transcription: “So we understand light as behaving like a wave like an ocean with a wavelength and a speed but also like a particle”

Sample 2

Transcription: “That is plotted here in the dotted yellow line. Then I'm gonna dilute by a factor of 10”

Sample 3

Transcription: “Captures an E6 and it was in this position now that Viswanathan Anand resigned the game and yet another victory for Magnus Carlsen”

Ablation Study in Lip2Wav Dataset

Sample 1

Transcription: “That will attract into the charged surface, so let's look at that. what I have here”

Sample 2

Transcription: “Defending, and now rook to D8 preparing to counter white rooks along the D file now bishop to G6”

Sample 3

Transcription: “Still it's a rook against the bishop and it will be very interesting to see how Mamedyarov does it King E1 with C5”

Unseen-speaker Experiment on GRID Dataset

In this experiment, we randomly select 15, 8, and 10 speakers for train, validation, and test.

Sample 1

Transcription: “Bin Blue by Y Zero Again”

Sample 2

Transcription: “Bin Blue at V Nine Soon”

Sample 3

Transcription: “Place Blue by M Eight Again”