Welcome to the VoxSRC Workshop 2023! This workshop showcases presentations from the most thrilling and innovative submissions to the
VoxCeleb Speaker Recognition Challenge
(VoxSRC). Also, we announced the challenge winners of all four tracks.
The workshop was held in conjunction with Interspeech 2023.
VoxSRC 2023 was a hybrid workshop, with both in-person and virtual attendance options.
Date / Time : 5:30pm Ireland time (GMT+1) on Sunday 20th August 2023.
Venue : Pearse suite, Clayton hotel Cardiff Lane, Sir John Rogerson's Quay, Dublin 2.
The workshop was held from 5:30pm - 8:30pm, Time in Ireland (GMT+1). These slides were used during the workshop.
|5:30pm||Introduction of VoxSRC challenges and Reflection|
|6:00pm||Keynote speech : Wei-Ning Hsu "Scalable Controllable Speech Generation via Explicitly and Implicitly Disentangled Speech Representations"|
|7:00pm||Announcement of Winners (Track 1, 2)|
|7:05pm||Invited Talks from Track 1 and 2|
|7:30pm||Announcement of Winners (Track 3)|
|7:32pm||Invited Talks from Track 3|
|7:50pm||Announcement of Winners (Track 4)|
|7:55pm||Invited Talks from Track 4|
|8:20pm||Wrap up discussion and conclusion|
The registration for this workshop is closed now.
Scalable Controllable Speech Generation via Explicitly and Implicitly Disentangled Speech Representations
Speech analysis and speech generation are the inverse processes of each other. The goal of speech analysis is to tease apart speech to infer the latent generating factors, while the goal of speech generation is to create a new audio given those factors. In order to build a controllable speech generation model where attributes can be independently manipulated, it often requires training a model with disentangled representations (e.g., transcript, speaker embedding, normalized F0) as input, such that during inference time each attribute can be controlled via the corresponding input stream. Speaker embedding, together with speech transcript, are considered the two most critical representations, which capture the majority of speech variation. We refer this to the “explicitly disentangled” paradigm and introduce a few studies from this camp, including Unit-HiFiGAN and ReVISE.
While the “explicitly disentangled” paradigm works reasonably well on small-scale curated datasets, it often struggles when developing models on in-the-wild datasets which contain variation unspecified by the input streams, such as emotion and background noise. To tackle that, we introduce Voicebox, an “implicitly disentangled” paradigm for controllable speech generation. It learns to infer non-textual attributes from an audio context in order to control various “audio style” attributes for generation, which includes not only voice, but also emotion, quality, and environment. This paradigm enables speech generative models to scale successfully in model size and training data (>50K hours). Finally, we conclude the talk with discussions of future research on leveraging both explicit and implicit representations for better disentangled control, and topics on the potential of learning shared text/audio embeddings for audio style.
Wei-Ning Hsu is a research scientist at Meta Foundational AI Research (FAIR) based in New York, USA. He has 10+ years of research experience in machine learning and speech processing. With 75+ peer-reviewed articles, he's a leading authority in self-supervised learning, generative modeling for speech, and multimodal speech analysis. His pioneering work includes Voicebox, HuBERT, AV-HuBERT, data2vec, Textless NLP, wav2vec-U, MMS, ResDAVENet-VQ, GMVAE-Tacotron, and FHVAE.
Before joining FAIR, Wei-Ning interned at MERL (hosted by Shinji Watanabe, Jonathan Le Roux, John Hershey), Google Brain (hosted by Yu Zhang and Yuxuan Wang), and Facebook (hosted by Awni Hannun and Gabriel Synnaeve). He earned his Ph.D. and S.M. degrees in Electrical Engineering and Computer Science from MIT (supervised by Dr. James Glass), and a B.S. in Electrical Engineering from National Taiwan University (supervision of Prof. Lin-shan Lee and Prof. Hsuan-Tien Lin).
Slides : [link]
Mitchell McLaren, Speech Technology and Research
Laboratory, SRI International, CA,
Douglas A Reynolds, Lincoln Laboratory, MIT.