VoxSRC 2023 challenge

Tracks

VoxSRC 2023 will feature four tracks. All four tracks are identical to previous year. Track 1, 2 and 3 are speaker verification tracks, where the task is to determine whether two samples of speech are from the same person. Track 4 is a speaker diarisation track, where the task is to break up multi-speaker audio into homogenous single speaker segments, effectively solving ‘who spoke when’.

#	Description
Track 1	Fully supervised speaker verification (closed train set) Train set : Participants can only use VoxCeleb2 dev dataset for which we have already released speaker labels. Validation set : Please look at the data section below.
Track 2	Fully supervised speaker verification (open train set) Train set : Participants can use VoxCeleb2 dev dataset and any other data except the challenge test data Validation set : Please look at the data section below.
Track 3	Semi-supervised domain adaptation (closed train set) Train set : Participants may train on (1) a large set of labelled data in a source domain (VoxCeleb2 dev dataset with speaker labels), (2) a large set of unlabelled data in a target domain (a subset of the CnCeleb2 dev set without speaker labels), and (3) a small set of labelled data in a target domain (a small set of CnCeleb data with speaker labels.). Validation set : Please look at the data section below.
Track 4	Speaker diarisation (open train set) Train set : Participants are allowed to use any data except the challenge test data Validation set : We provide both the dev and test set of VoxConverse to use in validation. Please use version 0.3 for the competition.

Data

Train set

Please read the Tracks section carefully and choose the data for training your own models. Below are some hyperlinks that may be helpful to you.

VoxCeleb	CNCeleb	VoxMovies	VoxConverse

Validation set

Verification tracks

We have included some of the utterances from VoxCeleb1, VoxMovies and VoxConverse dataset in validation set. You are allowed to use these datasets for training in Track 2. Validation data for Track 3 is same as last year.

For verification validation set, please download all three files using wget. Then, run zip -F VoxSRC2023_val.zip --out VoxSRC2023_val_total.zip, before unzipping VoxSRC2023_val_total.zip. Specifically, run the commands in your terminal:


            wget https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2023/VoxSRC2023_val.z01 

            wget https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2023/VoxSRC2023_val.z02 
 
            wget https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2023/VoxSRC2023_val.zip 

            zip -F VoxSRC2023_val.zip --out VoxSRC2023_val_total.zip 

            unzip VoxSRC2023_val_total.zip

File		MD5 Checksum
Track 1 & 2 validation wavfiles	See the instruction above.
Track 1 & 2 validation trial pairs	Download	`8c3476802e14682f11ea356954ceab8e`
Track 3 unsupervised target domain data	Download	`b0157d5cb961ecb1f5f617625fb843a1`
Track 3 supervised target domain data	Download	`57170ba6c8c5223be0cefc6ab1b43e5f`
Track 3 validation wavfiles	Download	`50fccc3315cf7b18d6575350d8fb043d`
Track 3 validation trial pairs	Download	`97f71af121f620363f86070089adad02`

Diarisation track

This time we use both dev and test set of VoxConverse as our validation set on diarisation track. Please visit VoxConverse website to download the dataset.

Test set

Verification tracks

For verification validation set, please download all four files using wget. Then, run zip -F VoxSRC2023_test.zip --out VoxSRC2023_test_total.zip, before unzipping VoxSRC2023_test_total.zip. Specifically, run the commands in your terminal:


            wget https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2023/VoxSRC2023_test.z01 

            wget https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2023/VoxSRC2023_test.z02 
 
            wget https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2023/VoxSRC2023_test.z03 

            wget https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data_workshop_2023/VoxSRC2023_test.zip 

            zip -F VoxSRC2023_test.zip --out VoxSRC2023_test_total.zip 

            unzip VoxSRC2023_test_total.zip

File		MD5 Checksum
Track 1 & 2 test wavfiles	See the instruction above.
Track 1 & 2 test trial pairs	Download	`e42df439e75da53fcdfdf0821999ee36`
Track 3 test wavfiles	Download	`05b7bbba05bb80f94ed192b4833e6bda`
Track 3 test trial pairs	Download	`c055d52b3aa043fc1419462af13c55db`

Diarisation track

File		MD5 Checksum
Track 4 test wavfiles	Download	`2c7b562df1eb3b52d39b57e9eb267890`

Evaluation Metrics

Speaker Verification

For the Speaker Verification tracks, we will display both the Equal Error Rate (EER) and the Minimum Detection Cost (CDet). For tracks 1 and 2, the primary metric for the challenge will be the Detection Cost, and the final ranking of the leaderboard will be determined using this score alone. For track 3, the primary metric is EER, as this is a more forgiving metric.

Equal Error Rate

This is the rate used to determine the threshold value for a system when its false acceptance rate (FAR) and false rejection rate (FRR) are equal.

Minimum Detection Cost

Compared to equal error-rate, which assigns equal weight to false negatives and false positives, this error-rate is usually used to assess performance in settings where achieving a low false positive rate is more important than achieving a low false negative rate. We follow the procedure outlined in Sec 3.1 of the NIST 2018 Speaker Recognition Evaluation Plan, for the AfV trials. To avoid ambiguity, we mention here that we will use the following parameters: C_Miss = 1, C_FalseAlarm = 1, and P_Target = 0.05

Speaker Diarisation

For the Speaker Diarisation track, we will display both the Diarisation Error Rate (DER) and the Jaccard Error Rate (JER), but the leaderboard will be ranked using the Diarisation Error Rate (DER) only.

Diarisation Error Rate

The Diarisation Error Rate (DER) is the sum of
1. speaker error - percentage of scored time for which the wrong speaker id is assigned within a speech region.
2. false alarm speech - percentage of scored time for which a nonspeech region is incorrectly marked as containing speech
3. missed speech - percentage of scored time for which a speech region is incorrectly marked as not containing speech.

We use a collar of 0.25 seconds and include overlapping speech in the scoring. For more details, consult section 6.1 of the NIST RT-09 evaluation plan.

Jaccard Error Rate

We also report the Jaccard error rate (JER), a metric introduced for the DIHARD II challenge that is based on the Jaccard index. The Jaccard index is a similarity measure typically used to evaluate the output of image segmentation systems and is defined as the ratio between the intersection and union of two segmentations. To compute Jaccard error rate, an optimal mapping between reference and system speakers is determined and for each pair the Jaccard index of their segmentations is computed. The Jaccard error rate is then 1 minus the average of these scores. For more details please consult Sec 3 of the DIHARD Challenge Report.

Code for computing all metrics on the validation data has been provided in the development toolkit.

Challenge	Links
VoxSRC-19	challenge / workshop
VoxSRC-20	challenge / workshop
VoxSRC-21	challenge / workshop
VoxSRC-22	challenge / workshop

FAQs

Q. Who is allowed to participate?
A. Any researcher, whether in academia or industry, is invited to participate in VoxSRC . We only request a valid official email address, associated with an institution for registration, once the registration system opens. This ensures we limit the number of submissions per team.

Q: Do I need to use the name of my institution or my real name as the team name for a submission?
A: No, you do not have to. The name of the CodaLab user (or the Team name, if you have set up one in CodaLab) that uploads the submission will be used in the public leaderboard. Hence if you do not want your details to be public, you should anonymise if appropriate. You must select a team name before the server's closing time.

Q: For the semi-supervised track 3, can I use my model that I trained for the closed track 1?
A: Yes. For track 3, participants are allowed to train on the VoxCeleb2 dev dataset. So participants can use their model that was trained for track 1.

Q: For the semi-supervised track 3, can I use the CnCeleb validation set?
A: No. For track 3, participants can only use the provided validation set.

Q. Can I participate in only some tracks?
A. Yes, you can participate in as many tracks as you like and be considered for each one independantly.

Q: How many submissions can I make?
A: You can only make 1 submission per day. In total, you can make only 10 submissions to the test set for each track.

Q: Can I train on other external datasets (public, or not)?
A: Only for the OPEN tracks. Not for the CLOSED tracks.

Q: Can I use data augmentation?
A: Yes, you can use any kind of noise or music, as long as you are not training on additional speech data, for the CLOSED tracks. You may also use the MUSAN noise dataset as augmentation for the CLOSED tracks. For the OPEN track, you can train on any data you see fit.

Q. Can I participate in the challenge but not submit a report describing my method?
A. We do not allow that option. Entries to the challenge will only be considered if a technical report is submitted on time. This should not affect later publications of your method if you restrict your report to 2 pages including references. You can still submit to the leaderboard, however, even if you do not submit a technical report.

Q. Will the technical report submitted to this workshop be archived by Interspeech 2023?
A. No. We shall use the papers to select some authors to present their work at the workshop.

Q. Will there be prizes for the winners?
A. Yes, there will be cash prizes for the top 3 on the leaderboard for each track.

Q. For the CLOSED condition, can I use the validation set for training anything, eg. the PLDA parameters?
A. No, for the CLOSED condition you can use the validation set only to tune user-defined hyperparameters, eg. for example selecting which convolutional model to use.

Q. For the CLOSED conditions, what can I use as the validation set?
A. For the closed conditions, participants may only use the provided pairs for this year's challenge, or the VoxCeleb1 pairs. These must strictly NOT be used for training. It is beneficial for participants to use this year's provided validation pairs, as their distribution matches that of the hidden test pairs.

Q. What kind of supervision can I use when training without labels in the semi-supervised track?
A. Self-supervision is an increasingly popular field of machine learning which does not use manually labelled training data for a particular task. The supervision for training instead comes from the data itself, for example from the future frames of a video or from another modality, such as faces.

Q. For the semi-supervised track, when I am training on the large set of target domain data without labels, can I use the total number of speakers in the CnCeleb2 dev set as a hyperparameter?
A. No, you cannot use any speaker identity information at all. You cannot use the number of speakers in any way, e.g. to determine the number of clusters for a clustering algorithm.

Q. What if I have an additional question about the competition?
A. If you are registered in the CodaLab competition, please post your question in the competition forum (rather than contact the organisers directly by e-mail) and we will answer it as soon as possible. The reason for this approach is that others may have similar questions: use of the forum ensures that the question can be useful for everyone. If you rather make your question before registering, please follow the procedure in the Organisers section below.

~~May 20th~~	Development set for verification tracks released.
~~May 31st~~	Development set for diarisation tracks released.
~~June 9th~~	Test set released and evaluation server open.
~~August 8th~~	Deadline for submission of results.
~~August 15th~~	Deadline for technical report.
~~August 20th~~	Challenge workshop

The VoxCeleb Speaker Recognition Challenge 2023
(VoxSRC-23)

The workshop will be held in conjunction with Interspeech 2023.

All test labels are released now. Checkout our website.

Timeline

Tracks

#

Description

Track 1

Track 2

Track 3

Track 4

Data

Train set

Validation set

Verification tracks

Diarisation track

Test set

Verification tracks

Diarisation track

Evaluation Metrics

Speaker Verification

Equal Error Rate

Minimum Detection Cost

Speaker Diarisation

Diarisation Error Rate

Jaccard Error Rate

Challenge registration

Previous Challenges

Technical Description

FAQs

Organisers

Advisors

Acknowledgements

This work is supported by the EPSRC(Engineering and Physical Research Council) programme grant EP/T028572/1: Visual AI project.