Multimodal AI Lab @ KAIST

Professional Experience

Korea Advanced Institute of Science and Technology, Republic of Korea

2021 - current

Associate Professor / Assistant Professor

Directing research in speech processing, computer vision and machine learning.

Naver Corporation, Republic of Korea

2018 - 2021

Team Lead / Research Scientist

Managed the development of speech recognition model for the Clova Note application.
Developed speaker recognition system for the LINE AI speaker.

Education

University of Oxford, United Kingdom

2014 - 2018

D.Phil. in Engineering Science

Thesis: "Visual recognition of human communication"

Advisor: Andrew Zisserman

University of Oxford, United Kingdom

2010 - 2014

M.Eng. (B.A.) in Engineering, Economics and Management

First Class Honours

Selected awards

Best Paper Finalist, IEEE SLT

2021

Best Student Paper Award, Interspeech

2020

Winner, ActivityNet Challenge Task B

2019

Best Student Paper Award, Interspeech

2017

Best Student Paper Award, Asian Conference on Computer Vision

2016

Publications

2024

VoxSim: A perceptual voice similarity dataset
J. Ahn, Y. Kim, Y. Choi, D. Kwak, J. Kim, S. Mun, J. S. Chung
Interspeech

Lightweight Audio Segmentation for Long-form Speech Translation
J. Lee, S. Kim, H. Kim, J. S. Chung
Interspeech

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions
J. Feng, M. H. Erol, J. S. Chung, A. Senocak
Interspeech

To what extent can ASV systems naturally defend against spoofing attacks?
J. Jung, X. Wang, N. Evans, S. Watanabe, H. Shim, H. Tak, S. Arora, J. Yamagishi, J. S. Chung
Interspeech

Disentangled Representation Learning for Environment-agnostic Speaker Recognition
K. Nam, H. Heo, J. Jung, J. S. Chung
Interspeech

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
C. Jung, S. Lee, J. Kim, J. S. Chung
Interspeech

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
J. Kim, H. Lee, K. Rho, J. Kim, J. S. Chung
International Conference on Machine Learning

Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
Y. Jang, J. Kim, J. Ahn, D. Kwak, H. Yang, Y. Ju, I. Kim, B. Kim, J. S. Chung
IEEE Conference on Computer Vision and Pattern Recognition

Scaling Up Video Summarization Pretraining with Large Language Models
D. M. Argaw, S. Yoon, F. C. Heilbron, H. Deilamsalehy, T. Bui, Z. Wang, F. Dernoncourt, J. S. Chung
IEEE Conference on Computer Vision and Pattern Recognition

Towards Automated Movie Trailer Generation
D. M. Argaw, M. Soldan, A. Pardo, C. Zhao, F. C. Heilbron, J. S. Chung, B. Ghanem
IEEE Conference on Computer Vision and Pattern Recognition

FreGrad: Lightweight and fast frequency-aware diffusion vocoder
T. D. Nguyen, J. Kim, Y. Jang, J. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

SlowFast Network for Continuous Sign Language Recognition
J. Ahn, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification
H. Heo, K. Nam, B. Lee, Y. Kwon, M. Lee, Y. J. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Speech Guided Masked Image Modeling for Visually Grounded Speech
J. Woo, H. Ryu, A. Senocak, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

VoxMM: Rich Transcription of Conversations in the Wild
D. Kwak, J. Jung, K. Nam, Y. Jang, J. Jung, S. Watanabe, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

From Coarse To Fine: Efficient Training for Audio Spectrogram Transformers
J. Feng, M. H. Erol, J. S. Chung, A. Senocak
International Conference on Acoustics, Speech, and Signal Processing

VoiceLDM: Text-to-Audio Generation with Linguistic Content
Y. Lee, I. Yeon, J. Nam, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

TalkNCE: Improving Active Speaker Detection with Talking-Aware Contrastive Learning
C. Jung, S. Lee, K. Nam, K. Rho, Y. J. Kim, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model
S. Lee, C. Jung, Y. Jang, J. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
J. Kim, J. Kim, J. S. Chung
AAAI Conference on Artificial Intelligence

Can CLIP Help Sound Source Localization?
S. Park, A. Senocak, J. S. Chung
Winter Conference on Applications of Computer Vision

2023

That's What I Said: Fully-Controllable Talking Face Generation
Y. Jang, K. Rho, J. Woo, H. Lee, J. Park, Y. Lim, B. Kim, J. S. Chung
ACM International Conference on Multimedia

Sound Source Localization is All about Cross-Modal Alignment
A. Senocak, H. Ryu, J. Kim, T. Oh, H. Pfister, J. S. Chung
International Conference on Computer Vision

FlexiAST: Flexibility is What AST Needs
J. Feng, M. H. Erol, J. S. Chung, A. Senocak
Interspeech

Disentangled Representation Learning for Multilingual Speaker Recognition
K. Nam, Y. Kim, J. Huh, H. Heo, J. Jung, J. S. Chung
Interspeech

Curriculum learning for self-supervised speaker verification
H. Heo, J. Jung, J. Kang, Y. Kwon, B. Lee, Y. J. Kim, J. S. Chung
Interspeech

Self-sufficient framework for continuous sign language recognition
Y. Jang, Y. Oh, J. W. Cho, M. Kim, D. Kim, I. S. Kweon, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Metric learning for user-defined keyword spotting
J. Jung, Y. Kim, J. Park, Y. Lim, B. Kim, Y. Jang, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Hindi as a second language: improving visually grounded speech with semantically similar samples
H. Ryu, A. Senocak, I. S. Kweon, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

MarginNCE: Robust Sound Localization with a Negative Margin
S. Park, A. Senocak, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity
Y. J. Kim, H. Heo, J. Jung, Y. Kwon, B. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

In search of strong embedding extractors for speaker diarisation
J. Jung, B. Lee, J. Huh, A. Brown, Y. Kwon, S. Watanabe, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech
J. Lee, J. S. Chung, S. Chung
International Conference on Acoustics, Speech, and Signal Processing

2022

Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition
Y. Jang, Y. Oh, J. W. Cho, D. Kim, J. S. Chung, I. S. Kweon
British Machine Vision Conference

Augmentation adversarial training for self-supervised speaker representation learning
J. Kang, J. Huh, H. Heo, J. S. Chung
Journal of Selected Topics in Signal Processing

Pushing the limits of raw waveform speaker recognition
J. Jung, Y. J. Kim, H. Heo, B. Lee, Y. Kwon, J. S. Chung
Interspeech

Spell my name: Keyword boosted speech recognition
N. Jung, G. Kim, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Multi-scale speaker embedding-based graph attention networks for speaker diarisation
Y. Kwon, H. Heo, J. Jung, Y. J. Kim, B. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks
J. Jung, H. Heo, H. Tak, H. Shim, J. S. Chung, B. Lee, H. Yu, N. Evans
International Conference on Acoustics, Speech, and Signal Processing

2021

Adapting Speaker Embeddings for Speaker Diarization
Y. Kwon, J. Jung, H. Heo, Y. J. Kim, B. Lee, J. S. Chung
Interspeech

Three-class Overlapped Speech Detection using a Convolutional Recurrent Neural Network
J. Jung, H. Heo, Y. Kwon, J. S. Chung, B. Lee
Interspeech

Look Who's Talking: Active Speaker Detection in the Wild
Y. J. Kim, H. Heo, S. Choe, S. Chung, Y. Kwon, B. Lee, Y. Kwon, J. S. Chung
Interspeech

Playing a Part: Speaker Verification at the Movies
A. Brown, J. Huh, A. Nagrani, J. S. Chung, A. Zisserman
International Conference on Acoustics, Speech, and Signal Processing

The ins and outs of speaker recognition: lessons from VoxSRC 2020
Y. Kwon, H. Heo, B. Lee, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Graph Attention Networks for Speaker Verification
J. Jung, H. Heo, H. Yu, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

Look who's not talking
Y. Kwon, H. Heo, J. Huh, B. Lee, J. S. Chung
IEEE Spoken Language Technology Workshop

Metric Learning for Keyword Spotting
J. Huh, M. Lee, H. Heo, S. Mun, J. S. Chung
IEEE Spoken Language Technology Workshop

Cross attentive pooling for speaker verification
S. M. Kye, Y. Kwon, J. S. Chung
IEEE Spoken Language Technology Workshop

Supervised attention for speaker recognition
S. M. Kye, J. S. Chung, H. Kim
IEEE Spoken Language Technology Workshop

2020

Perfect Match: Self-Supervised Embeddings for Cross-modal Retrieval
S. Chung, J. S. Chung, H. Kang
Journal of Selected Topics in Signal Processing

Augmentation adversarial training for self-supervised speaker recognition
J. Huh, H. Heo, J. Kang, S. Watanabe, J. S. Chung
Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS

FaceFilter: Audio-visual speech separation using still images
S. Chung, S. Choe, J. S. Chung, H. Kang
Interspeech

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
S. Chung, H. Kang, J. S. Chung
Interspeech

Spot the conversation: speaker diarisation in the wild
J. S. Chung, J. Huh, A. Nagrani, T. Afouras, A. Zisserman
Interspeech

Now you’re speaking my language: Visual language identification
T. Afouras, J. S. Chung, A. Zisserman
Interspeech

In defence of metric learning for speaker recognition
J. S. Chung, J. Huh, S. Mun, M. Lee, H. Heo, S. Choe, C. Ham, S. Jung, B. Lee, I. Han
Interspeech

Self-supervised learning of audio-visual objects from video
T. Afouras, A. Owens, J. S. Chung, A. Zisserman
European Conference on Computer Vision

BSL-1K: Scaling up co-articulated sign recognition using mouthing cues
S. Albanie, G. Varol, L. Momeni, T. Afouras, J. S. Chung, N. Fox, A. Zisserman
European Conference on Computer Vision

Delving into VoxCeleb: environment invariant speaker recognition
J. S. Chung, J. Huh, S. Mun
Speaker Odyssey

ASR is all you need: Cross-modal distillation for lip reading
T. Afouras, J. S. Chung, A. Zisserman
International Conference on Acoustics, Speech, and Signal Processing

Disentangled Speech Embeddings using Cross-Modal Self-Supervision
A. Nagrani, J. S. Chung, S. Albanie, A. Zisserman
International Conference on Acoustics, Speech, and Signal Processing

The sound of my voice: speaker representation loss for target voice separation
S. Mun, S. Choe, J. Huh, J. S. Chung
International Conference on Acoustics, Speech, and Signal Processing

2019

Deep Audio-Visual Speech Recognition
T. Afouras, J. S. Chung, A. Senior, O. Vinyals, A. Zisserman
IEEE Transactions on Pattern Analysis and Machine Intelligence

You said that? : Synthesising talking faces from audio
A. Jamaludin, J. S. Chung, A. Zisserman
International Journal of Computer Vision

VoxCeleb: Large-scale speaker verification in the wild
A. Nagrani, J. S. Chung, W. Xie, A. Zisserman
Computer Speech and Language

Who said that?: Audio-visual speaker diarisation of real-world meetings
J. S. Chung, B. Lee, I. Han
Interspeech

My lips are concealed: Audio-visual speech enhancement through obstructions
T. Afouras, J. S. Chung, A. Zisserman
Interspeech

Naver at ActivityNet Challenge 2019--Task B Active Speaker Detection (AVA)
J. S. Chung
International Challenge on Activity Recognition

Utterance-level Aggregation For Speaker Recognition In The Wild
W. Xie, A. Nagrani, J. S. Chung, A. Zisserman
International Conference on Acoustics, Speech, and Signal Processing

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation
S. Chung, J. S. Chung, H. Kang
International Conference on Acoustics, Speech, and Signal Processing

2018

Learning to Lip Read Words by Watching Videos
J. S. Chung, A. Zisserman
Computer Vision and Image Understanding

VoxCeleb2: Deep Speaker Recognition
J. S. Chung, A. Nagrani, A. Zisserman
Interspeech

The Conversation: Deep Audio-Visual Speech Enhancement
T. Afouras, J. S. Chung, A. Zisserman
Interspeech

Deep Lip Reading: a comparison of models and an online application
T. Afouras, J. S. Chung, A. Zisserman
Interspeech

2017

VoxCeleb: a large-scale speaker identification dataset
A. Nagrani, J. S. Chung, A. Zisserman
Interspeech

You said that?
J. S. Chung, A. Jamaludin, A. Zisserman
British Machine Vision Conference

Lip Reading in Profile
J. S. Chung, A. Zisserman
British Machine Vision Conference

Lip Reading Sentences in the Wild
J. S. Chung, A. Senior, O. Vinyals, A. Zisserman
IEEE Conference on Computer Vision and Pattern Recognition

2016

Out of time: automated lip sync in the wild
J. S. Chung, A. Zisserman
Workshop on Multi-view Lip-reading, ACCV

Lip Reading in the Wild
J. S. Chung, A. Zisserman
Asian Conference on Computer Vision

Signs in time: Encoding human motion as a temporal image
J. S. Chung, A. Zisserman
Workshop on Brave New Ideas for Motion Representations, ECCV

Contact

Professional Experience

Korea Advanced Institute of Science and Technology, Republic of Korea

2021 - current

Naver Corporation, Republic of Korea

2018 - 2021

Education

University of Oxford, United Kingdom

2014 - 2018

University of Oxford, United Kingdom

2010 - 2014

Selected awards

Best Paper Finalist, IEEE SLT

2021

Best Student Paper Award, Interspeech

2020

Winner, ActivityNet Challenge Task B

2019

Best Student Paper Award, Interspeech

2017

Best Student Paper Award, Asian Conference on Computer Vision

2016

Publications

2024

2023

2022

2021

2020

2019

2018

2017

2016