The goal of this work is to devise a self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses the issues of sign language recognition, including (1) the demand for complex features such as hands, face, and mouth for understanding and (2) the absence of frame level annotations.
To this end, we propose (1) Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without additional networks or annotations and (2) Dense Pseudo-Label Refinement (DPLR) which propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence label with the predicted sequence.
We demonstrate that our model achieves state-of-the-art performance among RGB-based methods experimentally on large-scale CSLR benchmarks, PHOENIX-2014 and PHOENIX-2014-T, while showing comparable results with better efficiency compared to the other approaches that use multi-modality or extra annotations.
We show more experimental results to support our framework's novelty.
We show our framework's the robustness in a real world scenario by chainging scale and translation during the inference time. Futhermore, we show the failure cases of pose-detector in STMC [1], where the transformation (scale, translation) are adapted. Note that our framework requires only RGB modality.
We compare the computational complexity with the most recent multi-cue based method, STMC. Even though the pose detector of STMC is light-weight, it still induces the bottleneck in the inference time. We highlight that DFConv significantly reduces both FLOPs and inference time by pulling out the pose estimator. For reference, in out environment, extracting human keypoints with HRNet [2] from PHOENIX-2014 dataset [3] takes 2-3 GPU days. Note that we implement STMC due to the absence of the code.
To demonstrate the wide applicability of DPLR, we compare DPLR with other CSLR approaches using pseudo-labeling.
We visualize more qualitative examples.
To explain the effectiveness of DPLR, we show more qualitative results of gloss-level sequence predictions. Note that the extra network is not required for the Dense Pseudo-Labels (DPL), and as the classifier for DPLR is an auxiliary, it does not affect on the inference time, which is important factor to satisfy real-time operation.
Our proposed solution directly contributes to the development of a sign translation system by providing high-quality multi-cue aware visual features to modern sign translation models. Advanced sign interpretation technologies could help socially marginalized deaf people and improve accessibility in social infrastructures such as education and health, which hearing people take for granted. However, current available large-scale PHOENIX/T benchmarks, which are sourced in a specific domain (e.g., weather forecast) could bias the model towards the certain scenario in language and visual appearance, leading to potential miscommunication, which could affect the lives deaf people.
Although we have shown that both non-manual and manual expressions are simultaneously captured from a sign video, the limitation of our work would come from the assumption that non-manual expressions occur in upper region of a frame, and vice versa for the manual expressions. While we address the issue by introducing the adaptability of the division ratio r at test time, not only the position, but also variations in scale of the signer (e.g. due to distance from camera) could be introduced in practical scenarios. Future CSLR works should embrace such practical challenges so that the recognition system can be deployed in the real world with ease.
[1] Zhou, Hao, et al. "Spatial-temporal multi-cue network for continuous sign language recognition." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.
[2] Sun, Ke, et al. "Deep high-resolution representation learning for human pose estimation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019
[3] Koller, Oscar, Jens Forster, and Hermann Ney. "Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers." Computer Vision and Image Understanding 141 (2015): 108-125.
[4] Cheng, Ka Leong, et al. "Fully convolutional networks for continuous sign language recognition." European Conference on Computer Vision. Springer, Cham, 2020.
[5] Min, Yuecong, et al. "Visual alignment constraint for continuous sign language recognition." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
[6] Selvaraju, Ramprasaath R., et al. "Grad-cam: Visual explanations from deep networks via gradient-based localization." Proceedings of the IEEE international conference on computer vision. 2017.
[7] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).