Multimodal AI Lab @ KAIST

Publications

2026

LP-CFM: Perceptual Invariance-Aware Conditional Flow Matching for Speech Modeling
D. Kwak, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
T. D. Nguyen, J. Kim, J. Kim, S. Choi, Y. Lim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model
T. H. Pham, T. D. Nguyen, P. T. Tran, J. S. Chung, D. D. Nguyen
International Conference on Acoustics, Speech, and Signal Processing
PDF
Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap
K. Nam, J. Choi, H. Lee, J. Heo, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
LAMB: LLM-Based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
H. Lee, J. Choi, K. Nam, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
UNMIXX: Untangling Highly Correlated Singing Voices Mixtures
J. Jung, J. Kim, D. Kwak, J. Lee, J. Nam, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
FastAV: Efficient Token Pruning for Audio-Visual Large Language Model Inference
C. Jung, Y. Jang, S. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

2025

Toward Interactive Sound Source Localization: Better Align Sight and Sound!
A. Senocak, H. Ryu, J. Kim, T. Oh, H. Pfister, J. S. Chung
IEEE Transactions on Pattern Analysis and Machine Intelligence
PDF
SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
J. Jung, Y. Wu, X. Wang, J. Kim, S. Maiti, Y. Matsunaga, H. Shim, J. Tian, N. Evans, J. S. Chung, W. Zhang, S. Um, S. Takamichi, S. Watanabe
IEEE Open Journal of Signal Processing
PDF
CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
J. Kim, H. Yang, Y. Ju, I. Kim, B. Kim, J. S. Chung
IEEE Transactions on Audio, Speech and Language Processing
PDF
AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding
C. Jung, Y. Jang, J. S. Chung
Conference on Neural Information Processing Systems
PDF
Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation
K. Zhang, T. X. Pham, S. Lee, A. Niu, A. Senocak, J. S. Chung
Conference on Neural Information Processing Systems
PDF
Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision
C. Zhang, K. Zhang, J. S. Chung, I. S. Kweon, J. Kim, C. Mao
Conference on Neural Information Processing Systems
PDF
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
J. Choi, J. Kim, J. S. Chung
Findings of Empirical Methods in Natural Language Processing
PDF
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
J. Choi, J. Kim, S. Kim, T. Oh, J. S. Chung
ACM International Conference on Multimedia
PDF
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
S. Kim, J. Choi, P. Peng, J. S. Chung, T. Oh, D. Harwath
International Conference on Computer Vision
PDF
MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
S. Cho, J. Choi, S. Kim, S. Yun
International Conference on Computer Vision
PDF
InfiniteAudio: Infinite-Length Audio Generation with Consistency
C. Jung, H. Ki, J. Kim, J. Kim, J. S. Chung
Interspeech
PDF
SEED: Speaker Embedding Enhancement Diffusion Model
K. Nam, J. Heo, J. Jung, G. Park, C. Jung, H. Yu, J. S. Chung
Interspeech
PDF
Accelerating Diffusion-based Text-to-Speech Model Training with Dual Modality Alignment
J. Choi, Z. Niu, J. Kim, C. Wang, J. S. Chung, X. Chen
Interspeech
PDF
The text-to-speech in the wild (TITW) dataset
J. Jung, W. Zhang, S. Maiti, Y. Wu, X. Wang, J. Kim, Y. Matsunaga, S. Um, J. Tian, H. Shim, N. Evans, J. S. Chung, S. Takamichi, S. Watanabe
Interspeech
PDF
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes
H. Ryu, S. Kim, J. S. Chung, A. Senocak
IEEE Conference on Computer Vision and Pattern Recognition
PDF
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech
J. Kim, J. Choi, J. Kim, C. Jung, J. S. Chung
IEEE Conference on Computer Vision and Pattern Recognition
PDF
Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues
Y. Jang, H. Raajesh, L. Momeni, G. Varol, A. Zisserman
IEEE Conference on Computer Vision and Pattern Recognition
PDF
Test-Time Augmentation for Pose-invariant Face Recognition
J. Jung, Y. Jang, J. S. Chung
IEEE International Conference on Automatic Face and Gesture Recognition
PDF
High-Quality Joint Image and Video Compression with Causal VAE
D. M. Argaw, X. Liu, Q. Zhang, J. S. Chung, M. Liu, F. Reda
International Conference on Learning Representations
PDF
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
S. Kim, H. Oh, J. Lee, A. Senocak, J. S. Chung, T. Oh
International Conference on Learning Representations
PDF
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation
Z. Li, S. Hu, S. Liu, L. Zhou, J. Choi, L. Meng, X. Guo, J. Li, H. Ling, F. Wei
International Conference on Learning Representations
PDF
V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
J. Choi, J. Kim, J. Li, J. S. Chung, S. Liu
International Conference on Acoustics, Speech, and Signal Processing
PDF
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
K. Rho, H. Lee, V. Iverson, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis
J. Jung, J. Ahn, C. Jung, T. D. Nguyen, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
T. D. Nguyen, J. Kim, J. Choi, S. Choi, J. Park, Y. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
AdaptVC: High Quality Voice Conversion with Adaptive Learning
J. Kim, J. Kim, Y. Choi, T. D. Nguyen, S. Mun, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

2024

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
M. H. Erol, A. Senocak, J. Feng, J. S. Chung
IEEE Signal Processing Letters
PDF
Bridging the Gap between Audio and Text using Parallel-attention for User-defined Keyword Spotting
Y. Kim, J. Jung, J. Park, B. Kim, J. S. Chung
IEEE Signal Processing Letters
PDF
Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding
J. Woo, H. Ryu, Y. Jang, J. W. Cho, J. S. Chung
ACM International Conference on Multimedia
PDF
VoxSim: A perceptual voice similarity dataset
J. Ahn, Y. Kim, Y. Choi, D. Kwak, J. Kim, S. Mun, J. S. Chung
Interspeech
PDF
Lightweight Audio Segmentation for Long-form Speech Translation
J. Lee, S. Kim, H. Kim, J. S. Chung
Interspeech
PDF
ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
J. Feng, M. H. Erol, J. S. Chung, A. Senocak
Interspeech
PDF
To what extent can ASV systems naturally defend against spoofing attacks?
J. Jung, X. Wang, N. Evans, S. Watanabe, H. Shim, H. Tak, S. Arora, J. Yamagishi, J. S. Chung
Interspeech
PDF
Disentangled Representation Learning for Environment-agnostic Speaker Recognition
K. Nam, H. Heo, J. Jung, J. S. Chung
Interspeech
PDF
FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
C. Jung, S. Lee, J. Kim, J. S. Chung
Interspeech
PDF
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
J. Kim, H. Lee, K. Rho, J. Kim, J. S. Chung
International Conference on Machine Learning
PDF
Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
Y. Jang, J. Kim, J. Ahn, D. Kwak, H. Yang, Y. Ju, I. Kim, B. Kim, J. S. Chung
IEEE Conference on Computer Vision and Pattern Recognition
PDF
Scaling Up Video Summarization Pretraining with Large Language Models
D. M. Argaw, S. Yoon, F. C. Heilbron, H. Deilamsalehy, T. Bui, Z. Wang, F. Dernoncourt, J. S. Chung
IEEE Conference on Computer Vision and Pattern Recognition
PDF
Towards Automated Movie Trailer Generation
D. M. Argaw, M. Soldan, A. Pardo, C. Zhao, F. C. Heilbron, J. S. Chung, B. Ghanem
IEEE Conference on Computer Vision and Pattern Recognition
PDF
FreGrad: Lightweight and fast frequency-aware diffusion vocoder
T. D. Nguyen, J. Kim, Y. Jang, J. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF Project page
SlowFast Network for Continuous Sign Language Recognition
J. Ahn, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification
H. Heo, K. Nam, B. Lee, Y. Kwon, M. Lee, Y. J. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
Speech Guided Masked Image Modeling for Visually Grounded Speech
J. Woo, H. Ryu, A. Senocak, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
VoxMM: Rich Transcription of Conversations in the Wild
D. Kwak, J. Jung, K. Nam, Y. Jang, J. Jung, S. Watanabe, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
From Coarse To Fine: Efficient Training for Audio Spectrogram Transformers
J. Feng, M. H. Erol, J. S. Chung, A. Senocak
International Conference on Acoustics, Speech, and Signal Processing
PDF
VoiceLDM: Text-to-Audio Generation with Linguistic Content
Y. Lee, I. Yeon, J. Nam, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF Project page
TalkNCE: Improving Active Speaker Detection with Talking-Aware Contrastive Learning
C. Jung, S. Lee, K. Nam, K. Rho, Y. J. Kim, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model
S. Lee, C. Jung, Y. Jang, J. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
J. Kim, J. Kim, J. S. Chung
AAAI Conference on Artificial Intelligence
PDF Project page
Can CLIP Help Sound Source Localization?
S. Park, A. Senocak, J. S. Chung
Winter Conference on Applications of Computer Vision
PDF

2023

That's What I Said: Fully-Controllable Talking Face Generation
Y. Jang, K. Rho, J. Woo, H. Lee, J. Park, Y. Lim, B. Kim, J. S. Chung
ACM International Conference on Multimedia
PDF Project page
Sound Source Localization is All about Cross-Modal Alignment
A. Senocak, H. Ryu, J. Kim, T. Oh, H. Pfister, J. S. Chung
International Conference on Computer Vision
PDF
FlexiAST: Flexibility is What AST Needs
J. Feng, M. H. Erol, J. S. Chung, A. Senocak
Interspeech
PDF
Disentangled Representation Learning for Multilingual Speaker Recognition
K. Nam, Y. Kim, J. Huh, H. Heo, J. Jung, J. S. Chung
Interspeech
PDF Project page
Curriculum learning for self-supervised speaker verification
H. Heo, J. Jung, J. Kang, Y. Kwon, B. Lee, Y. J. Kim, J. S. Chung
Interspeech
PDF
Self-sufficient framework for continuous sign language recognition
Y. Jang, Y. Oh, J. W. Cho, M. Kim, D. Kim, I. S. Kweon, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF Project page
Metric learning for user-defined keyword spotting
J. Jung, Y. Kim, J. Park, Y. Lim, B. Kim, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF Project page
Hindi as a second language: improving visually grounded speech with semantically similar samples
H. Ryu, A. Senocak, I. S. Kweon, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
MarginNCE: Robust Sound Localization with a Negative Margin
S. Park, A. Senocak, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity
Y. J. Kim, H. Heo, J. Jung, Y. Kwon, B. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
In search of strong embedding extractors for speaker diarisation
J. Jung, B. Lee, J. Huh, A. Brown, Y. Kwon, S. Watanabe, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech
J. Lee, J. S. Chung, S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF

2022

Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition
Y. Jang, Y. Oh, J. W. Cho, D. Kim, J. S. Chung, I. S. Kweon
British Machine Vision Conference
PDF Project page
Augmentation adversarial training for self-supervised speaker representation learning
J. Kang, J. Huh, H. Heo, J. S. Chung
Journal of Selected Topics in Signal Processing
PDF
Pushing the limits of raw waveform speaker recognition
J. Jung, Y. J. Kim, H. Heo, B. Lee, Y. Kwon, J. S. Chung
Interspeech
PDF
Spell my name: Keyword boosted speech recognition
N. Jung, G. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
Multi-scale speaker embedding-based graph attention networks for speaker diarisation
Y. Kwon, H. Heo, J. Jung, Y. J. Kim, B. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing
PDF
AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks
J. Jung, H. Heo, H. Tak, H. Shim, J. S. Chung, B. Lee, H. Yu, N. Evans
International Conference on Acoustics, Speech, and Signal Processing
PDF