V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow

1KAIST, 2Microsoft
Interpolate start reference image.

Abstract

In this paper, we introduce V2SFlow, a novel Video-to-Speech (V2S) framework designed to generate natural and intelligible speech directly from silent talking face videos. While recent V2S systems have shown promising results on constrained datasets with limited speakers and vocabularies, their performance often degrades on real-world, unconstrained datasets due to the inherent variability and complexity of speech signals. To address these challenges, we decompose the speech signal into manageable subspaces (content, pitch, and speaker information), each representing distinct speech attributes, and predict them directly from the visual input. To generate coherent and realistic speech from these predicted attributes, we employ a rectified flow matching decoder built on a transformer architecture, which models efficient probabilistic pathways from random noise to the target speech distribution. Extensive experiments demonstrate that V2SFlow significantly outperforms state-of-the-art methods, even surpassing the naturalness of ground-truth utterances.

Comparison results on LRS3 dataset

Silent video Ground Truth SVTS Intelligible V2SFlow-A DiffV2S LTBS V2SFlow-V Text

but you know what

so the answer to the second question can we change

and that's why it's been a pleasure speaking to you

originally the sample was aged 18 to 94

they were wonderful people

we were making what was invisible visible

and when we're unprepared we overreact

you destroy all the bone marrow in the cancer patient with massive doses of chemotherapy

my spine curves spiral

it turns out the evidence says otherwise

you see there's a lot of things we just don't have data on

and the soldier on the front tank said we have unconditional orders to destroy this

to me it's not that clear

it's just not possible

we were wrong

we had drawn a blank and it wasn't just iraq and afghanistan

and they definitely have been shown to be effective in some cases

this is not a photo profile for your facebook

who is going to inherit his name and his fortune

all of you are members of tribes