Member image
Joon Son Chung
정준선
Associate Professor
School of Electrical Engineering, KAIST

Contact

  Room 3102, N24 (LG Innovation Hall)
  7470

Professional Experience

Korea Advanced Institute of Science and Technology, Republic of Korea

2021 - current

Associate Professor / Assistant Professor

  • Directing research in speech processing, computer vision and machine learning.

Naver Corporation, Republic of Korea

2018 - 2021

Team Lead / Research Scientist

  • Managed the development of speech recognition model for the Clova Note application.
  • Developed speaker recognition system for the LINE AI speaker.

Education

University of Oxford, United Kingdom

2014 - 2018

D.Phil. in Engineering Science

Thesis: "Visual recognition of human communication"

Advisor: Andrew Zisserman

University of Oxford, United Kingdom

2010 - 2014

M.Eng. (B.A.) in Engineering, Economics and Management

First Class Honours

Selected awards

Best Paper Finalist, IEEE SLT

2021

Best Student Paper Award, Interspeech

2020

Winner, ActivityNet Challenge Task B

2019

Best Student Paper Award, Interspeech

2017

Best Student Paper Award, Asian Conference on Computer Vision

2016

Publications

2024

  • VoxSim: A perceptual voice similarity dataset
    J. Ahn, Y. Kim, Y. Choi, D. Kwak, J. Kim, S. Mun, J. S. Chung
    Interspeech
  • Lightweight Audio Segmentation for Long-form Speech Translation
    J. Lee, S. Kim, H. Kim, J. S. Chung
    Interspeech
  • To what extent can ASV systems naturally defend against spoofing attacks?
    J. Jung, X. Wang, N. Evans, S. Watanabe, H. Shim, H. Tak, S. Arora, J. Yamagishi, J. S. Chung
    Interspeech
  • Disentangled Representation Learning for Environment-agnostic Speaker Recognition
    K. Nam, H. Heo, J. Jung, J. S. Chung
    Interspeech
  • FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
    C. Jung, S. Lee, J. Kim, J. S. Chung
    Interspeech
  • EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
    J. Kim, H. Lee, K. Rho, J. Kim, J. S. Chung
    International Conference on Machine Learning
  • Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
    Y. Jang, J. Kim, J. Ahn, D. Kwak, H. Yang, Y. Ju, I. Kim, B. Kim, J. S. Chung
    IEEE Conference on Computer Vision and Pattern Recognition
  • Scaling Up Video Summarization Pretraining with Large Language Models
    D. M. Argaw, S. Yoon, F. C. Heilbron, H. Deilamsalehy, T. Bui, Z. Wang, F. Dernoncourt, J. S. Chung
    IEEE Conference on Computer Vision and Pattern Recognition
  • Towards Automated Movie Trailer Generation
    D. M. Argaw, M. Soldan, A. Pardo, C. Zhao, F. C. Heilbron, J. S. Chung, B. Ghanem
    IEEE Conference on Computer Vision and Pattern Recognition
  • FreGrad: Lightweight and fast frequency-aware diffusion vocoder
    T. D. Nguyen, J. Kim, Y. Jang, J. Kim, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • SlowFast Network for Continuous Sign Language Recognition
    J. Ahn, Y. Jang, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification
    H. Heo, K. Nam, B. Lee, Y. Kwon, M. Lee, Y. J. Kim, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Speech Guided Masked Image Modeling for Visually Grounded Speech
    J. Woo, H. Ryu, A. Senocak, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • VoxMM: Rich Transcription of Conversations in the Wild
    D. Kwak, J. Jung, K. Nam, Y. Jang, J. Jung, S. Watanabe, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • From Coarse To Fine: Efficient Training for Audio Spectrogram Transformers
    J. Feng, M. H. Erol, J. S. Chung, A. Senocak
    International Conference on Acoustics, Speech, and Signal Processing
  • VoiceLDM: Text-to-Audio Generation with Linguistic Content
    Y. Lee, I. Yeon, J. Nam, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • TalkNCE: Improving Active Speaker Detection with Talking-Aware Contrastive Learning
    C. Jung, S. Lee, K. Nam, K. Rho, Y. J. Kim, Y. Jang, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model
    S. Lee, C. Jung, Y. Jang, J. Kim, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Let There Be Sound: Reconstructing High Quality Speech from Silent Videos
    J. Kim, J. Kim, J. S. Chung
    AAAI Conference on Artificial Intelligence

2023

  • That's What I Said: Fully-Controllable Talking Face Generation
    Y. Jang, K. Rho, J. Woo, H. Lee, J. Park, Y. Lim, B. Kim, J. S. Chung
    ACM International Conference on Multimedia
  • Sound Source Localization is All about Cross-Modal Alignment
    A. Senocak, H. Ryu, J. Kim, T. Oh, H. Pfister, J. S. Chung
    International Conference on Computer Vision
  • Disentangled Representation Learning for Multilingual Speaker Recognition
    K. Nam, Y. Kim, J. Huh, H. Heo, J. Jung, J. S. Chung
    Interspeech
  • Curriculum learning for self-supervised speaker verification
    H. Heo, J. Jung, J. Kang, Y. Kwon, B. Lee, Y. J. Kim, J. S. Chung
    Interspeech
  • Self-sufficient framework for continuous sign language recognition
    Y. Jang, Y. Oh, J. W. Cho, M. Kim, D. Kim, I. S. Kweon, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Metric learning for user-defined keyword spotting
    J. Jung, Y. Kim, J. Park, Y. Lim, B. Kim, Y. Jang, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Hindi as a second language: improving visually grounded speech with semantically similar samples
    H. Ryu, A. Senocak, I. S. Kweon, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • MarginNCE: Robust Sound Localization with a Negative Margin
    S. Park, A. Senocak, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity
    Y. J. Kim, H. Heo, J. Jung, Y. Kwon, B. Lee, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • In search of strong embedding extractors for speaker diarisation
    J. Jung, B. Lee, J. Huh, A. Brown, Y. Kwon, S. Watanabe, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech
    J. Lee, J. S. Chung, S. Chung
    International Conference on Acoustics, Speech, and Signal Processing

2022

  • Signing Outside the Studio: Benchmarking Background Robustness for Continuous Sign Language Recognition
    Y. Jang, Y. Oh, J. W. Cho, D. Kim, J. S. Chung, I. S. Kweon
    British Machine Vision Conference
  • Augmentation adversarial training for self-supervised speaker representation learning
    J. Kang, J. Huh, H. Heo, J. S. Chung
    Journal of Selected Topics in Signal Processing
  • Pushing the limits of raw waveform speaker recognition
    J. Jung, Y. J. Kim, H. Heo, B. Lee, Y. Kwon, J. S. Chung
    Interspeech
  • Spell my name: Keyword boosted speech recognition
    N. Jung, G. Kim, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Multi-scale speaker embedding-based graph attention networks for speaker diarisation
    Y. Kwon, H. Heo, J. Jung, Y. J. Kim, B. Lee, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks
    J. Jung, H. Heo, H. Tak, H. Shim, J. S. Chung, B. Lee, H. Yu, N. Evans
    International Conference on Acoustics, Speech, and Signal Processing

2021

  • Adapting Speaker Embeddings for Speaker Diarization
    Y. Kwon, J. Jung, H. Heo, Y. J. Kim, B. Lee, J. S. Chung
    Interspeech
  • Three-class Overlapped Speech Detection using a Convolutional Recurrent Neural Network
    J. Jung, H. Heo, Y. Kwon, J. S. Chung, B. Lee
    Interspeech
  • Look Who's Talking: Active Speaker Detection in the Wild
    Y. J. Kim, H. Heo, S. Choe, S. Chung, Y. Kwon, B. Lee, Y. Kwon, J. S. Chung
    Interspeech
  • Playing a Part: Speaker Verification at the Movies
    A. Brown, J. Huh, A. Nagrani, J. S. Chung, A. Zisserman
    International Conference on Acoustics, Speech, and Signal Processing
  • The ins and outs of speaker recognition: lessons from VoxSRC 2020
    Y. Kwon, H. Heo, B. Lee, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Graph Attention Networks for Speaker Verification
    J. Jung, H. Heo, H. Yu, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing
  • Look who's not talking
    Y. Kwon, H. Heo, J. Huh, B. Lee, J. S. Chung
    IEEE Spoken Language Technology Workshop
  • Metric Learning for Keyword Spotting
    J. Huh, M. Lee, H. Heo, S. Mun, J. S. Chung
    IEEE Spoken Language Technology Workshop
  • Cross attentive pooling for speaker verification
    S. M. Kye, Y. Kwon, J. S. Chung
    IEEE Spoken Language Technology Workshop
  • Supervised attention for speaker recognition
    S. M. Kye, J. S. Chung, H. Kim
    IEEE Spoken Language Technology Workshop

2020

  • Perfect Match: Self-Supervised Embeddings for Cross-modal Retrieval
    S. Chung, J. S. Chung, H. Kang
    Journal of Selected Topics in Signal Processing
  • Augmentation adversarial training for self-supervised speaker recognition
    J. Huh, H. Heo, J. Kang, S. Watanabe, J. S. Chung
    Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS
  • FaceFilter: Audio-visual speech separation using still images
    S. Chung, S. Choe, J. S. Chung, H. Kang
    Interspeech
  • Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
    S. Chung, H. Kang, J. S. Chung
    Interspeech
  • Spot the conversation: speaker diarisation in the wild
    J. S. Chung, J. Huh, A. Nagrani, T. Afouras, A. Zisserman
    Interspeech
  • Now you’re speaking my language: Visual language identification
    T. Afouras, J. S. Chung, A. Zisserman
    Interspeech
  • In defence of metric learning for speaker recognition
    J. S. Chung, J. Huh, S. Mun, M. Lee, H. Heo, S. Choe, C. Ham, S. Jung, B. Lee, I. Han
    Interspeech
  • Self-supervised learning of audio-visual objects from video
    T. Afouras, A. Owens, J. S. Chung, A. Zisserman
    European Conference on Computer Vision
  • BSL-1K: Scaling up co-articulated sign recognition using mouthing cues
    S. Albanie, G. Varol, L. Momeni, T. Afouras, J. S. Chung, N. Fox, A. Zisserman
    European Conference on Computer Vision
  • Delving into VoxCeleb: environment invariant speaker recognition
    J. S. Chung, J. Huh, S. Mun
    Speaker Odyssey
  • ASR is all you need: Cross-modal distillation for lip reading
    T. Afouras, J. S. Chung, A. Zisserman
    International Conference on Acoustics, Speech, and Signal Processing
  • Disentangled Speech Embeddings using Cross-Modal Self-Supervision
    A. Nagrani, J. S. Chung, S. Albanie, A. Zisserman
    International Conference on Acoustics, Speech, and Signal Processing
  • The sound of my voice: speaker representation loss for target voice separation
    S. Mun, S. Choe, J. Huh, J. S. Chung
    International Conference on Acoustics, Speech, and Signal Processing

2019

  • Deep Audio-Visual Speech Recognition
    T. Afouras, J. S. Chung, A. Senior, O. Vinyals, A. Zisserman
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • You said that? : Synthesising talking faces from audio
    A. Jamaludin, J. S. Chung, A. Zisserman
    International Journal of Computer Vision
  • VoxCeleb: Large-scale speaker verification in the wild
    A. Nagrani, J. S. Chung, W. Xie, A. Zisserman
    Computer Speech and Language
  • Who said that?: Audio-visual speaker diarisation of real-world meetings
    J. S. Chung, B. Lee, I. Han
    Interspeech
  • My lips are concealed: Audio-visual speech enhancement through obstructions
    T. Afouras, J. S. Chung, A. Zisserman
    Interspeech
  • Naver at ActivityNet Challenge 2019--Task B Active Speaker Detection (AVA)
    J. S. Chung
    International Challenge on Activity Recognition
  • Utterance-level Aggregation For Speaker Recognition In The Wild
    W. Xie, A. Nagrani, J. S. Chung, A. Zisserman
    International Conference on Acoustics, Speech, and Signal Processing
  • Perfect match: Improved cross-modal embeddings for audio-visual synchronisation
    S. Chung, J. S. Chung, H. Kang
    International Conference on Acoustics, Speech, and Signal Processing

2018

  • Learning to Lip Read Words by Watching Videos
    J. S. Chung, A. Zisserman
    Computer Vision and Image Understanding
  • VoxCeleb2: Deep Speaker Recognition
    J. S. Chung, A. Nagrani, A. Zisserman
    Interspeech
  • The Conversation: Deep Audio-Visual Speech Enhancement
    T. Afouras, J. S. Chung, A. Zisserman
    Interspeech
  • Deep Lip Reading: a comparison of models and an online application
    T. Afouras, J. S. Chung, A. Zisserman
    Interspeech

2017

  • VoxCeleb: a large-scale speaker identification dataset
    A. Nagrani, J. S. Chung, A. Zisserman
    Interspeech
  • You said that?
    J. S. Chung, A. Jamaludin, A. Zisserman
    British Machine Vision Conference
  • Lip Reading in Profile
    J. S. Chung, A. Zisserman
    British Machine Vision Conference
  • Lip Reading Sentences in the Wild
    J. S. Chung, A. Senior, O. Vinyals, A. Zisserman
    IEEE Conference on Computer Vision and Pattern Recognition

2016

  • Out of time: automated lip sync in the wild
    J. S. Chung, A. Zisserman
    Workshop on Multi-view Lip-reading, ACCV
  • Lip Reading in the Wild
    J. S. Chung, A. Zisserman
    Asian Conference on Computer Vision
  • Signs in time: Encoding human motion as a temporal image
    J. S. Chung, A. Zisserman
    Workshop on Brave New Ideas for Motion Representations, ECCV

KAIST logo