Welcome to the VoxSRC Workshop 2022! The workshop includes presentations from the most exciting and novel submissions to the VoxCeleb Speaker Recognition Challenge (VoxSRC), as well as the announcement of the challenge winners.
The workshop was held in conjunction with Interspeech 2022.
VoxSRC 2022 was a hybrid workshop, with both in-person and virtual attendance options. It took place at 5pm KST on Thursday 22nd September 2022 (8am UTC), at Room 110 of Incheon Convensia. Please find our workshop report below.
You could see the information of all series of this workshop on this website.
The workshop was held from 5:00pm - 8:00pm Korea Standard Time (KST).
5:00pm | Introduction of dataset and challenges [slides] [video] |
5:25pm | Keynote speech : Junichi Yamagishi "The use of speaker embeddings in neural audio generation" [slides] [video] |
6:15pm | 10 min break |
6:25pm | Announcement of Winners (Track 1,2) |
6:30pm | Invited Talks from Track 1 and 2 [video] |
7:10pm | Announcement of Winners (Track 3) |
7:12pm | Invited Talks from Track 3 [video] |
7:30pm | Announcement of Winners (Track 4) |
7:35pm | Invited Talks from Track 4 [video] |
7:55pm | Wrap up discussion and conclusion |
Team | Track | File |
ravana - ID R&D | 1,2 | |
KristonAI | 1,2,4 | |
SJTU-AIspeech | 1,3 | arXiv |
Strasbourg_spk | 2 | arXiv |
NSYSU-CHT | 1,2,3 | |
ReturnZero | 1 | arXiv |
zzdddz | 1,3 | |
DKU-Tencent | 1,3 | |
Royalflush | 1,3 | arXiv |
DKU-DukeECE | 4 | |
AiTER | 4 | arXiv |
Pyannote | 4 | |
BUCEA | 4 | arXiv |
HYU | 3,4 | |
Newsbridge-Telecom SudParis | 4 |
The registration for this workshop is closed now.
The use of speaker embeddings in neural audio generation
Neural speaker embedding vectors are becoming an essential technology not only in speaker recognition but also in speech synthesis. In this talk, I will first outline how speaker embedding vectors are used in voice conversion, where one speaker's voice is converted to another speaker's voice, and in multi-speaker TTS systems, where multiple speakers’ natural-sounding voices can be synthesized from input sentences by a single model. Then I will explain how the performance of speaker vectors in the speaker recognition task is related to the speaker similarity of the synthesized voices. The latest performance of voice conversion systems will also be presented based on the results of the Voice Conversion Challenge 2020.
I will then introduce "speaker anonymization" as a new example of the use of speaker embeddings in the field of speech privacy. Speaker anonymization aims to convert only the speaker characteristics of the input speech so that the ASV does not identify the original speaker while preserving the usefulness of the anonymized audio in the downstream tasks the user wishes to perform. As an example of such speaker anonymization using speaker embedding vectors, we present a language-independent speaker anonymization system using ECAPA-TDNN, HuBERT, and HiFi-GAN and show its excellent evaluation results using the VoicePrivacy challenge metrics.
Junichi Yamagishi (Senior Member, IEEE) received a Ph.D. from the Tokyo Institute of Technology, Tokyo, Japan, in 2006. From 2007 to 2013, he was a Research Fellow with the Centre for Speech Technology Research, The University of Edinburgh, Edinburgh, U.K. In 2013, he was an Associate Professor with the National Institute of Informatics, Tokyo, Japan. He is currently a Professor at the National Institute of Informatics. His research interests include speech processing, machine learning, signal processing, biometrics, digital media cloning, and media forensics.
He is a co-organizer of the bi-annual ASVspoof Challenge and the bi-annual Voice Conversion Challenge. He was also a member of the IEEE speech and language technical committee during 2013–2019, an associate editor for IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING during 2014–2017, and a chairperson of ISCA SynSIG during 2017–2021. He is currently a PI of the JST-CREST and ANR-supported VoicePersona Project and a senior area editor of IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING.
Jaesung Huh, VGG, University of Oxford,
Andrew Brown, VGG, University of Oxford,
Arsha Nagrani, Google Research,
Joon Son Chung, KAIST, South Korea,
Jeeweon Jung, Naver, South Korea,
Andrew Zisserman, VGG, University of Oxford,
Daniel Garcia-Romero, AWS AI
Mitchell McLaren, Speech Technology and Research
Laboratory, SRI International, CA,
Douglas A Reynolds, Lincoln Laboratory, MIT.
Please contact jaesung[at]robots[dot]ox[dot]ac[dot]uk or abrown[at]robots[dot]ox[dot]ac[dot]uk if you have any queries, or if you would be interested in sponsoring this challenge.