From Faces to Voices:
Learning Hierarchical Representations for High-quality Video-to-Speech

Ji-Hoon Kim, Jeongsoo Choi, Jaehun Kim, Chaeyoung Jung, Joon Son Chung

Korea Advanced Institute of Science and Technology, Republic of Korea

Abstract

The objective of this study is to generate high-quality speech from silent talking face videos, a task also known as video-to-speech synthesis. A significant challenge in video-to-speech synthesis lies in the substantial modality gap between silent video and multi-faceted speech. In this paper, we propose a novel video-to-speech system that effectively bridges this modality gap, significantly enhancing the quality of synthesized speech. This is achieved by learning of hierarchical representations from video to speech. Specifically, we gradually transform silent video into acoustic feature spaces through three sequential stages - content, timbre, and prosody modeling. In each stage, we align visual factors - lip movements, face identity, and facial expressions - with corresponding acoustic counterparts to ensure the seamless transformation. Additionally, to generate realistic and coherent speech from the visual representations, we employ a flow matching model that estimates direct trajectories from a simple prior distribution to the target speech distribution. Extensive experiments demonstrate that our method achieves exceptional generation quality comparable to real utterances, outperforming existing methods by a significant margin.

Results from LRS3-TED Dataset

Silent Video
Ground Truth
SVTS
Intelligible
LTBS
DiffV2S
Ours (10)
Ours (1000)
    Sample 1: We really don't talk anymore
    Sample 2: How would I know what I was doing differently
    Sample 3: This meant that the movement demanding change knew what they were against crushing poverty widening
    Sample 4: And that's really what I find so intriguing about the reactions that we've had to the

Results from LRS2-BBC Dataset

Silent Video
Ground Truth
SVTS
Intelligible
LTBS
DiffV2S
Ours (10)
Ours (1000)
    Sample 1: Came up with the perfect name
    Sample 2: As a result of smoking
    Sample 3: If you lived in the building
    Sample 4: And, and for me the surprise was