Audio Wave

The VoxSRC Workshop 2022

Welcome to the VoxSRC Workshop 2022! The workshop includes presentations from the most exciting and novel submissions to the VoxCeleb Speaker Recognition Challenge (VoxSRC), as well as the announcement of the challenge winners.

The workshop was held in conjunction with Interspeech 2022.

VoxSRC 2022 was a hybrid workshop, with both in-person and virtual attendance options. It took place at 5pm KST on Thursday 22nd September 2022 (8am UTC), at Room 110 of Incheon Convensia. Please find our workshop report below.

You could see the information of all series of this workshop on this website.

Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman
arXiv, 2023.

Schedule

The workshop was held from 5:00pm - 8:00pm Korea Standard Time (KST).

5:00pm Introduction of dataset and challenges [slides] [video]
5:25pm Keynote speech : Junichi Yamagishi "The use of speaker embeddings in neural audio generation" [slides] [video]
6:15pm 10 min break
6:25pm Announcement of Winners (Track 1,2)
6:30pm Invited Talks from Track 1 and 2 [video]
7:10pm Announcement of Winners (Track 3)
7:12pm Invited Talks from Track 3 [video]
7:30pm Announcement of Winners (Track 4)
7:35pm Invited Talks from Track 4 [video]
7:55pm Wrap up discussion and conclusion

Introduction


Participant talks


Track 1, 2

Team ravana - ID R&D [slides] Team KristonAI [slides]

Team SJTU-AIspeech [slides] Team Strasbourg_spk [slides]

Track 3

Team zzdddz [slides] Team DKU-Tencent [slides]

Track 4

Team DKU-DukeECE [slides] Team KristonAI [slides] Team AiTER [slides]

Technical reports


Team   Track   File  
ravana - ID R&D 1,2 PDF
KristonAI 1,2,4 PDF
SJTU-AIspeech 1,3 arXiv
Strasbourg_spk 2 arXiv
NSYSU-CHT 1,2,3 PDF
ReturnZero   1 arXiv
zzdddz 1,3 PDF
DKU-Tencent 1,3 PDF
Royalflush 1,3 arXiv
DKU-DukeECE 4 PDF
AiTER 4 arXiv
Pyannote 4 PDF
BUCEA 4 arXiv
HYU 3,4 PDF
Newsbridge-Telecom SudParis 4 PDF

Workshop Registration

The registration for this workshop is closed now.

Keynote Speaker



Junichi Yamagishi

Title

The use of speaker embeddings in neural audio generation

Abstract

Neural speaker embedding vectors are becoming an essential technology not only in speaker recognition but also in speech synthesis. In this talk, I will first outline how speaker embedding vectors are used in voice conversion, where one speaker's voice is converted to another speaker's voice, and in multi-speaker TTS systems, where multiple speakers’ natural-sounding voices can be synthesized from input sentences by a single model. Then I will explain how the performance of speaker vectors in the speaker recognition task is related to the speaker similarity of the synthesized voices. The latest performance of voice conversion systems will also be presented based on the results of the Voice Conversion Challenge 2020.

I will then introduce "speaker anonymization" as a new example of the use of speaker embeddings in the field of speech privacy. Speaker anonymization aims to convert only the speaker characteristics of the input speech so that the ASV does not identify the original speaker while preserving the usefulness of the anonymized audio in the downstream tasks the user wishes to perform. As an example of such speaker anonymization using speaker embedding vectors, we present a language-independent speaker anonymization system using ECAPA-TDNN, HuBERT, and HiFi-GAN and show its excellent evaluation results using the VoicePrivacy challenge metrics.

Biography

Junichi Yamagishi (Senior Member, IEEE) received a Ph.D. from the Tokyo Institute of Technology, Tokyo, Japan, in 2006. From 2007 to 2013, he was a Research Fellow with the Centre for Speech Technology Research, The University of Edinburgh, Edinburgh, U.K. In 2013, he was an Associate Professor with the National Institute of Informatics, Tokyo, Japan. He is currently a Professor at the National Institute of Informatics. His research interests include speech processing, machine learning, signal processing, biometrics, digital media cloning, and media forensics.

He is a co-organizer of the bi-annual ASVspoof Challenge and the bi-annual Voice Conversion Challenge. He was also a member of the IEEE speech and language technical committee during 2013–2019, an associate editor for IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING during 2014–2017, and a chairperson of ISCA SynSIG during 2017–2021. He is currently a PI of the JST-CREST and ANR-supported VoicePersona Project and a senior area editor of IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING.

Presentation

Organisers

Jaesung Huh, VGG, University of Oxford,
Andrew Brown, VGG, University of Oxford,
Arsha Nagrani, Google Research,
Joon Son Chung, KAIST, South Korea,
Jeeweon Jung, Naver, South Korea,
Andrew Zisserman, VGG, University of Oxford,
Daniel Garcia-Romero, AWS AI

Advisors

Mitchell McLaren, Speech Technology and Research Laboratory, SRI International, CA,
Douglas A Reynolds, Lincoln Laboratory, MIT.

Please contact jaesung[at]robots[dot]ox[dot]ac[dot]uk or abrown[at]robots[dot]ox[dot]ac[dot]uk if you have any queries, or if you would be interested in sponsoring this challenge.

Sponsors

VoxSRC is proudly sponsored by Naver/Line.

Acknowledgements

This work is supported by the EPSRC(Engineering and Physical Research Council) programme grant EP/T028572/1: Visual AI project.