VoxConverse is an audio-visual diarisation dataset consisting of over 50 hours of multispeaker clips of human speech, extracted from YouTube videos
Spot the conversation: speaker diarisation in the wild
J. S. Chung*, J. Huh*, A. Nagrani*, T. Afouras, A. Zisserman
Interspeech, 2020
PDF