VoxMM Dataset

Overview

VoxMM is an audio-visual dataset containing rich transcriptions of spoken conversations from diverse domains of YouTube video clips. There is no overlap between the videos used to create the test set and dev set. The dataset version 1.0.0 statistics are given in the table below.

Set	Utterance (min)	# videos	# spk	# word instances	Vocab
Dev	6,094	247	2,253	1.04M	27k
Test	462	42	374	84k	6.5k

One of the main features of the VoxMM dataset is its abundant metadata, which enables the creation of datasets for various tasks. The basic preprocessing codes for the VoxMM dataset is provided in this GitHub repository.

Downloads

You can request the dataset here.

Please note that video files are not included in this dataset. If you require the video files, refer to the URL provided in the metadata.

Version Log

The metadata of VoxMM dataset is version controlled. The version log is as below.

v1.0.0

Initial release with the same videos as v0.0.0 but minor changes in the split. Improved label quality through additional manual refinement. Interjections and disfluencies are now distinguished in the script labels. Metadata structure revised for enhanced usability. Some attributes removed to address privacy concerns.

v0.0.0

Prototype version introduced in the ICASSP paper. Not released publicly.

License

The VoxMM dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.

In order to collect videos that simulate in-the-wild scenarios from as many diverse domains as possible, the dataset includes sensitive content such as political debates and news. The views and opinions expressed by the speakers in the dataset are those of the individual speakers and do not necessarily reflect the positions of the Korea Advanced Institute of Science and Technology (KAIST) or the authors of the paper.

We would also like to note that the distribution of identities in this dataset may not be representative the global human population. Please be careful of unintended societal, gender, racial, linguistic and other biases when training or deploying models trained on this data.

Publications

Please cite the following if you make use of the dataset.

VoxMM: Rich transcription of conversations in the wild
D. Kwak, J. Jung, K. Nam, Y. Jang, J. Jung, S. Watanabe J. S. Chung
International Conference on Acoustics, Speech and Signal Processing, 2024
PDF

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT, 2022-0-00989).