7,000 +
speakersVoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages.
1 million +
utterancesAll speaking face-tracks are captured "in the wild", with background chatter, laughter, overlapping speech, pose variation and different lighting conditions.
2,000 +
hoursVoxCeleb consists of both audio and video. Each segment is at least 3 seconds long.
We provide URLs for each YouTube video and timestamps for utterances. The frame number provided assumes that the video is saved at 25fps
.
File | MD5 Checksum | |
Dev | Download | 9c3b51e34038d1bdb2174dcc66543267 |
Test | Download | 8e06592a5f604e23e8cd10f421b36cc3 |
File | MD5 Checksum | |
Dev | Download | 0e7a9f083c4efc27982f748f5f0b540a |
Test | Download | f305b5347c9c45362b7c838b561cea7d |
You can request the audio-visual dataset here.
The VoxCeleb dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.
VoxCeleb: a large-scale speaker identification dataset
A. Nagrani*, J. S. Chung*, A. Zisserman
Interspeech, 2017
PDF
VoxCeleb2: Deep Speaker Recognition
J. S. Chung*, A. Nagrani*, A. Zisserman
Interspeech, 2018
PDF
VoxCeleb: Large-scale speaker verification in the wild
A. Nagrani*, J. S. Chung*, W. Xie, A. Zisserman
Computer Speech and Language, 2019
PDF