Audio Wave
A new, challenging speaker recognition domain & dataset

VoxMovies is an audio dataset, containing utterances sourced from movies with varying emotion, accents and background noise.

To bechmark performance of speaker recognition systems on this entirely new domain, VoxMovies contains a number of domain adaptation evaluation sets.



VoxMovies contains speech from speakers in VoxCeleb1 and VoxCeleb2 (speaker recognition training datasets), allowing for domain change within the same identity to be investigated.



VoxMovies is sourced from key moments in a wide variety of movies from the Condensed Movies dataset. These movies cover many different genres such as comedy, action, romance and horror.



VoxMovies consists of audio clips. On average each identity has utterances from 2.7 different movies. Variation in emotion and background noise is therefore seen within each identity, as well as across identities.


Movie genres featured in VoxMovies


Utterance lengths

Download and code

The dataset consists of a training and test partition, and several domain adaptation evaluation sets. Evaluation code can be found here.

File MD5 Checksum
WAV files Download de90dc74430a1f65157d9b5505da7799


The VoxMovies dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.

Caution: We note that the distribution of identities in the VoxMovies dataset may not be representative of the global human population. Please be careful of unintended societal, gender, racial and other biases when training or deploying models trained or evaluated on this data.

Please contact the authors below if you have any queries regarding the dataset.


Please cite the following if you make use of the dataset.

  • Playing a Part: Speaker Verification at the Movies
    A. Brown*, J. Huh*, A. Nagrani*, J. S. Chung, A. Zisserman
    International Conference on Acoustics, Speech and Signal Processing, 2021