Audio Wave

The VoxSRC Workshop 2023

Welcome to the VoxSRC Workshop 2023! This workshop showcases presentations from the most thrilling and innovative submissions to the VoxCeleb Speaker Recognition Challenge (VoxSRC). Also, we announced the challenge winners of all four tracks.
The workshop was held in conjunction with Interspeech 2023.

VoxSRC 2023 was a hybrid workshop, with both in-person and virtual attendance options.

Date / Time : 5:30pm Ireland time (GMT+1) on Sunday 20th August 2023.
Venue : Pearse suite, Clayton hotel Cardiff Lane, Sir John Rogerson's Quay, Dublin 2.


The workshop was held from 5:30pm - 8:30pm, Time in Ireland (GMT+1). These slides were used during the workshop.

5:30pm Introduction of VoxSRC challenges and Reflection
6:00pm Keynote speech : Wei-Ning Hsu "Scalable Controllable Speech Generation via Explicitly and Implicitly Disentangled Speech Representations"
6:45pm Coffee break
7:00pm Announcement of Winners (Track 1, 2)
7:05pm Invited Talks from Track 1 and 2
7:30pm Announcement of Winners (Track 3)
7:32pm Invited Talks from Track 3
7:50pm Announcement of Winners (Track 4)
7:55pm Invited Talks from Track 4
8:20pm Wrap up discussion and conclusion

Participant talks

Track 1, 2

Team Unisound [slides] Team ID R&D [slides] Team xg0721 [slides]

Track 3

Team DKU-MSXF [slides]

Track 4

Team DKU-MSXF [slides] Team KrispAI [slides] Team pyannote [slides]

Technical reports

Team   Track   File  
ChinaTelecom 1 PDF
ID R&D 2 arXiv
Unisound 1,2 arXiv
XG0721 1 PDF
DKU-MSXF 1,2,3 arXiv
Chinamobile 1,2 PDF
Intema 2 PDF
Wespeaker 1,2,3,4 arXiv
KrispAI 4 PDF
pyannote   4 PDF
DKU-MSXF 4 arXiv
GIST-AiTER 4 arXiv

Workshop Registration

The registration for this workshop is closed now.

Keynote Speaker


Wei-Ning Hsu


Scalable Controllable Speech Generation via Explicitly and Implicitly Disentangled Speech Representations


Speech analysis and speech generation are the inverse processes of each other. The goal of speech analysis is to tease apart speech to infer the latent generating factors, while the goal of speech generation is to create a new audio given those factors. In order to build a controllable speech generation model where attributes can be independently manipulated, it often requires training a model with disentangled representations (e.g., transcript, speaker embedding, normalized F0) as input, such that during inference time each attribute can be controlled via the corresponding input stream. Speaker embedding, together with speech transcript, are considered the two most critical representations, which capture the majority of speech variation. We refer this to the “explicitly disentangled” paradigm and introduce a few studies from this camp, including Unit-HiFiGAN and ReVISE.

While the “explicitly disentangled” paradigm works reasonably well on small-scale curated datasets, it often struggles when developing models on in-the-wild datasets which contain variation unspecified by the input streams, such as emotion and background noise. To tackle that, we introduce Voicebox, an “implicitly disentangled” paradigm for controllable speech generation. It learns to infer non-textual attributes from an audio context in order to control various “audio style” attributes for generation, which includes not only voice, but also emotion, quality, and environment. This paradigm enables speech generative models to scale successfully in model size and training data (>50K hours). Finally, we conclude the talk with discussions of future research on leveraging both explicit and implicit representations for better disentangled control, and topics on the potential of learning shared text/audio embeddings for audio style.


Wei-Ning Hsu is a research scientist at Meta Foundational AI Research (FAIR) based in New York, USA. He has 10+ years of research experience in machine learning and speech processing. With 75+ peer-reviewed articles, he's a leading authority in self-supervised learning, generative modeling for speech, and multimodal speech analysis. His pioneering work includes Voicebox, HuBERT, AV-HuBERT, data2vec, Textless NLP, wav2vec-U, MMS, ResDAVENet-VQ, GMVAE-Tacotron, and FHVAE.

Before joining FAIR, Wei-Ning interned at MERL (hosted by Shinji Watanabe, Jonathan Le Roux, John Hershey), Google Brain (hosted by Yu Zhang and Yuxuan Wang), and Facebook (hosted by Awni Hannun and Gabriel Synnaeve). He earned his Ph.D. and S.M. degrees in Electrical Engineering and Computer Science from MIT (supervised by Dr. James Glass), and a B.S. in Electrical Engineering from National Taiwan University (supervision of Prof. Lin-shan Lee and Prof. Hsuan-Tien Lin).


Slides : [link]


Jaesung Huh, VGG, University of Oxford
Jeeweon Jung, CMU
Andrew Brown, Meta AI
Arsha Nagrani, Google Research
Joon Son Chung, KAIST
Andrew Zisserman, VGG
Daniel Garcia-Romero, AWS AI


Mitchell McLaren, Speech Technology and Research Laboratory, SRI International, CA,
Douglas A Reynolds, Lincoln Laboratory, MIT.

Please contact jaesung[at]robots[dot]ox[dot]ac[dot]uk or abrown[at]robots[dot]ox[dot]ac[dot]uk if you have any queries, or if you would be interested in sponsoring this challenge.


This work is supported by the EPSRC(Engineering and Physical Research Council) programme grant EP/T028572/1: Visual AI project.