VoxCeleb

7,000 +

speakers

VoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages.

1 million +

utterances

All speaking face-tracks are captured "in the wild", with background chatter, laughter, overlapping speech, pose variation and different lighting conditions.

2,000 +

hours

VoxCeleb consists of both audio and video. Each segment is at least 3 seconds long.

Downloads

URLs and timestamps

We provide URLs for each YouTube video and timestamps for utterances. The frame number provided assumes that the video is saved at 25fps.

VoxCeleb1

File		MD5 Checksum
Dev	Download	`9c3b51e34038d1bdb2174dcc66543267`
Test	Download	`8e06592a5f604e23e8cd10f421b36cc3`

VoxCeleb2

File		MD5 Checksum
Dev	Download	`0e7a9f083c4efc27982f748f5f0b540a`
Test	Download	`f305b5347c9c45362b7c838b561cea7d`

Audio and video files

You can request the audio-visual dataset here.

Trial pairs for speaker verification

List of trial pairs - VoxCeleb1
List of trial pairs - VoxCeleb1 (cleaned)
List of trial pairs - VoxCeleb1-H
List of trial pairs - VoxCeleb1-H (cleaned)
List of trial pairs - VoxCeleb1-E
List of trial pairs - VoxCeleb1-E (cleaned)

VoxCeleb1-E and VoxCeleb1-H lists are drawn from the VoxCeleb1 training set. Therefore you cannot use any files in VoxCeleb1 for training if you are using these lists for testing.

Publications

Please cite the following if you make use of the dataset.

VoxCeleb: a large-scale speaker identification dataset
A. Nagrani*, J. S. Chung*, A. Zisserman
Interspeech, 2017
PDF
VoxCeleb2: Deep Speaker Recognition
J. S. Chung*, A. Nagrani*, A. Zisserman
Interspeech, 2018
PDF
VoxCeleb: Large-scale speaker verification in the wild
A. Nagrani*, J. S. Chung*, W. Xie, A. Zisserman
Computer Speech and Language, 2019
PDF