We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field.
To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts.
Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world dataset, showcasing significant improvements in both audio quality and modality integration.
Figure 2: For obtaining a high quality data, the data is generated through synthesizing clean speech and extracted sound and filtered considering WER scores and speech alignments. VoiceDiT is trained to generate environment-aware speech using a CLAP encoder, which encodes the environment prompt. During inference, VoiceDiT can generate audio that adheres to both the environmental and content text prompts, using a I2A-Translator to incorporate image modality.
Description provides the environmental context of the audio, and Content conveys the linguistic information.
Description: Nature environmental noise with various bird vocalization. Content: In nature, nothing is perfect and everything is perfect. |
Description: A racing cars are passing by and disappear. Content: The winning is for all the kids out there who dream the impossible! |
Description: Raining heavily with thunder while person is speaking. Content: It's raining heavily today. |
Description: Pop music that upbeat, catchy, and easy to listen, high fidelity, with simple melodies, electronic instruments and polished production. Content: This audio is generated by a text to speech model. |
Description: Ocean waves crashing as wind heavily blows. Content: This audio is generated by a text to speech model. |
Description: Battlefield scene, continuous roar of artillery and gunfire, the sharp crack of bullets, the thundering explosions of bombs, and the screams of wounded soldiers. Content: This audio is generated by a text to speech model. |
Description provides the environmental context of the audio, and Content conveys the linguistic information.
Please pay attention to the words highlighted in red in the content text to facilitate a comparison of speech quality.
Ground Truth | VoiceDiT (Ours) | VoiceLDM |
Description: Man talking with cranking noises in the background Content: Cranking noises Cranking see you know I've taken the parts out I |
||
Description: Beeping sound with person talking in the background Content: That isn't wired up by the way. Let's get this one here too. That's it. |
||
Description: Women, child talking and birds chirping Content: Come on. Yeah. Come on, let's go. Daddy's waiting. |
||
Description: A man talking as water streams in the background Content: Here we have a Britain Stratton 5 horsepower horse drove L4 motor and it is on the back of this 17 foot |
||
Description: A woman is speaking while food being fried is sizzling Content: Adding our drumsticks. Let's just... |
||
Description: A man speaking followed by another man speaking in the background as a motorcycle engine runs idle Content: Hi, I'm Lyle Doverspike from Hope, Pennsylvania. |
||
Description: Female speech, a toilet flushing and then more speech Content: Well see you next Tuesday |
||
Description: A man is talking, and food is frying Content: with anything else you have to season your vegetables so I'm going to take some black pepper and if any of you that watch me know that I |
||
Description: A man is speaking followed by gunfire Content: AA, west of your location. Objective Alpha has just been neutralized. |
||
Description: Rain falling with distant humming and a man speaking Content: Little bit of bad weather. |
||
Description: Waves roll in and a man speaks Content: No, it's still there. I think it's huge. |
||
VoiceDiT (Ours) | VoiceLDM | |
Description: Birds chirping. | ||
Description: A crowd of people shout and give applause | ||
Description: A violin playing a heartfelt melody. | ||
|
|
SpecVQGAN:
|
|
Im2Wav:
|
|
VoiceDiT (Ours):
|
|
||
SpecVQGAN:
|
||
Im2Wav:
|
||
VoiceDiT (Ours):
|
|
|
SpecVQGAN:
|
|
Im2Wav:
|
|
VoiceDiT (Ours):
|
|
|
SpecVQGAN:
|
|
Im2Wav:
|
|
VoiceDiT (Ours):
|
|
|
SpecVQGAN:
|
|
Im2Wav:
|
|
VoiceDiT (Ours):
|
|
|
SpecVQGAN:
|
|
Im2Wav:
|
|
VoiceDiT (Ours):
|
|
|
VoiceDiT (Ours):
|
|
|
VoiceDiT (Ours):
|
|
|
VoiceDiT (Ours):
|
VoiceDiT (Ours) | VoiceLDM |
Content: And lay me down in my cold bed and leave my shining lot. | |
Content: Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech. | |
Content: The army found the people in poverty and left them in comparative wealth. | |
Content: Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings. | |
@article{jung24voicedit,
author = {Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung},
title = {VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis},
journal = {arixv},
year = {2024},
}