V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

Jeongsoo Choi*, Ji-Hoon Kim*, Jinyu Li, Joon Son Chung, Shujie Liu

Abstract

In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground-truth utterances.

Samples from LRS3-TED Dataset

Silent video Ground Truth SVTS Intelligible V2SFlow-A DiffV2S LTBS V2SFlow-V Text
but you know what
so the answer to the second question can we change
and that's why it's been a pleasure speaking to you
originally the sample was aged 18 to 94
they were wonderful people
we were making what was invisible visible
and when we're unprepared we overreact
you destroy all the bone marrow in the cancer patient with massive doses of chemotherapy
my spine curves spiral
it turns out the evidence says otherwise
you see there's a lot of things we just don't have data on
and the soldier on the front tank said we have unconditional orders to destroy this
to me it's not that clear
it's just not possible
we were wrong
we had drawn a blank and it wasn't just iraq and afghanistan
and they definitely have been shown to be effective in some cases
this is not a photo profile for your facebook
who is going to inherit his name and his fortune
all of you are members of tribes