VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Jaemin Jung^*, Junseok Ahn^*, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung

Multimodal AI Lab, KAIST

^*Equal Contribution

Model architecture of VoiceDiT

Figure 1: VoiceDiT consists of a TTS module and a Dual-DiT model. A cross-attention module is integrated into each DiT block to inject environmental conditions.

Abstract

We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field.

To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the multi-modal prompts.

Extensive experimental results demonstrate that VoiceDiT outperforms previous models on real-world dataset, showcasing significant improvements in both audio quality and modality integration.

Overall pipeline of VoiceDiT

Figure 2: For obtaining a high quality data, the data is generated through synthesizing clean speech and extracted sound and filtered considering WER scores and speech alignments. VoiceDiT is trained to generate environment-aware speech using a CLAP encoder, which encodes the environment prompt. During inference, VoiceDiT can generate audio that adheres to both the environmental and content text prompts, using a I2A-Translator to incorporate image modality.

Environment-Aware Speech Synthesis

Description provides the environmental context of the audio, and Content conveys the linguistic information.

Description: Nature environmental noise with various bird vocalization. Content: In nature, nothing is perfect and everything is perfect.	Description: A racing cars are passing by and disappear. Content: The winning is for all the kids out there who dream the impossible!	Description: Raining heavily with thunder while person is speaking. Content: It's raining heavily today.

Description: Pop music that upbeat, catchy, and easy to listen, high fidelity, with simple melodies, electronic instruments and polished production. Content: This audio is generated by a text to speech model.	Description: Ocean waves crashing as wind heavily blows. Content: This audio is generated by a text to speech model.	Description: Battlefield scene, continuous roar of artillery and gunfire, the sharp crack of bullets, the thundering explosions of bombs, and the screams of wounded soldiers. Content: This audio is generated by a text to speech model.

AC-Filtered test set

Description provides the environmental context of the audio, and Content conveys the linguistic information.

Please pay attention to the words highlighted in red in the content text to facilitate a comparison of speech quality.

Ground Truth	VoiceDiT (Ours)	VoiceLDM
Description: Man talking with cranking noises in the background Content: Cranking noises Cranking see you know I've taken the parts out I

Description: Beeping sound with person talking in the background Content: That isn't wired up by the way. Let's get this one here too. That's it.

Description: Women, child talking and birds chirping Content: Come on. Yeah. Come on, let's go. Daddy's waiting.

Description: A man talking as water streams in the background Content: Here we have a Britain Stratton 5 horsepower horse drove L4 motor and it is on the back of this 17 foot

Description: A woman is speaking while food being fried is sizzling Content: Adding our drumsticks. Let's just...

Description: A man speaking followed by another man speaking in the background as a motorcycle engine runs idle Content: Hi, I'm Lyle Doverspike from Hope, Pennsylvania.

Description: Female speech, a toilet flushing and then more speech Content: Well see you next Tuesday

Description: A man is talking, and food is frying Content: with anything else you have to season your vegetables so I'm going to take some black pepper and if any of you that watch me know that I

Description: A man is speaking followed by gunfire Content: AA, west of your location. Objective Alpha has just been neutralized.

Description: Rain falling with distant humming and a man speaking Content: Little bit of bad weather.

Description: Waves roll in and a man speaks Content: No, it's still there. I think it's huge.

(Zero-Shot) Text-to-Audio

Even though VoiceDiT is only trained on speech-infused audio, it is possible to perform text-to-audio that does not include speech.

VoiceDiT (Ours)	VoiceLDM
Description: Birds chirping.

Description: A crowd of people shout and give applause

Description: A violin playing a heartfelt melody.

(Zero-Shot) Image-to-Audio

Our I2A-Translator bridges the gap between audio and images, enabling the model to synthesize high-fidelity, visually-relevant sounds based on images.


SpecVQGAN:
Im2Wav:
VoiceDiT (Ours):


SpecVQGAN:
Im2Wav:
VoiceDiT (Ours):


SpecVQGAN:
Im2Wav:
VoiceDiT (Ours):


SpecVQGAN:
Im2Wav:
VoiceDiT (Ours):


SpecVQGAN:
Im2Wav:
VoiceDiT (Ours):


SpecVQGAN:
Im2Wav:
VoiceDiT (Ours):


VoiceDiT (Ours):


VoiceDiT (Ours):


VoiceDiT (Ours):

Text-to-Speech

"clean speech" is given as the description prompt.

VoiceDiT (Ours)	VoiceLDM
Content: And lay me down in my cold bed and leave my shining lot.

Content: Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

Content: The army found the people in poverty and left them in comparative wealth.

Content: Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings.

BibTeX

@article{jung24voicedit,
  author    = {Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, Joon Son Chung},
  title     = {VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis},
  journal   = {arixv},
  year      = {2024},
}