The dataset consists of thousands of spoken sentences from TED and TEDx videos. There is no overlap between the videos used to create the test set and the ones used for the pre-train and trainval sets. The dataset statistics are given in the table below.
Set | # videos | # utterances | # word instances | Vocab |
Pre-train | 5,090 | 118,516 | 3.9M | 51k |
Trainval | 4,004 | 31,982 | 358k | 17k |
Test | 412 | 1,321 | 10k | 2k |
The Lip Reading Sentences 3 Languages (LRS3-Lang) dataset is an extended version of LRS3 (English-only) covering 13 different languages.
For every sample we provide: i) the URL ('ref' entry in the text file) and frame ids of the original YouTube video it was created from, ii) the face detection bounding box for every frame, iii) the word boundary timestamps (pre-train set only). The frame numbers provided assume that the video is sampled at 25fps.
Downloads are temporarily unavailable from this website until further notice.The LRS3 dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.
LRS3-TED: a large-scale dataset for visual speech recognition
T. Afouras, J. S. Chung, A. Zisserman
arXiv preprint arXiv:1809.00496
PDF