VoxMM is an audio-visual dataset containing rich transcriptions of spoken conversations from diverse domains of YouTube video clips. There is no overlap between the videos used to create the test set and dev set. The dataset version 1.0.0 statistics are given in the table below.
Set | Utterance (min) | # videos | # spk | # word instances | Vocab |
Dev | 6,094 | 247 | 2,253 | 1.04M | 27k |
Test | 462 | 42 | 374 | 84k | 6.5k |
One of the main features of the VoxMM dataset is its abundant metadata, which enables the creation of datasets for various tasks. The basic preprocessing codes for the VoxMM dataset is provided in this GitHub repository.
You can request the dataset here.
Please note that video files are not included in this dataset. If you require the video files, refer to the URL provided in the metadata.
The metadata of VoxMM dataset is version controlled. The version log is as below.
v1.0.0Initial release with the same videos as v0.0.0 but minor changes in the split. Improved label quality through additional manual refinement. Interjections and disfluencies are now distinguished in the script labels. Metadata structure revised for enhanced usability. Some attributes removed to address privacy concerns.
v0.0.0Prototype version introduced in the ICASSP paper. Not released publicly.
The VoxMM dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.
In order to collect videos that simulate in-the-wild scenarios from as many diverse domains as possible, the dataset includes sensitive content such as political debates and news. The views and opinions expressed by the speakers in the dataset are those of the individual speakers and do not necessarily reflect the positions of the Korea Advanced Institute of Science and Technology (KAIST) or the authors of the paper.
We would also like to note that the distribution of identities in this dataset may not be representative the global human population. Please be careful of unintended societal, gender, racial, linguistic and other biases when training or deploying models trained on this data.
VoxMM: Rich transcription of conversations in the wild
D. Kwak, J. Jung, K. Nam, Y. Jang, J. Jung, S. Watanabe J. S. Chung
International Conference on Acoustics, Speech and Signal Processing, 2024
PDF
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT, 2022-0-00989).