Self-Sufficient Framework for Continuous Sign Language Recognition
About
The goal of this work is to develop self-sufficient framework for Continuous Sign Language Recognition (CSLR) that addresses key issues of sign language recognition. These include the need for complex multi-scale features such as hands, face, and mouth for understanding, and absence of frame-level annotations. To this end, we propose (1) Divide and Focus Convolution (DFConv) which extracts both manual and non-manual features without the need for additional networks or annotations, and (2) Dense Pseudo-Label Refinement (DPLR) which propagates non-spiky frame-level pseudo-labels by combining the ground truth gloss sequence labels with the predicted sequence. We demonstrate that our model achieves state-of-the-art performance among RGB-based methods on large-scale CSLR benchmarks, PHOENIX-2014 and PHOENIX-2014-T, while showing comparable results with better efficiency when compared to other approaches that use multi-modality or extra annotations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Continuous Sign Language Recognition | PHOENIX14-T (dev) | WER20.5 | 75 | |
| Continuous Sign Language Recognition | PHOENIX-2014T (test) | WER22.3 | 43 | |
| Continuous Sign Language Recognition | Phoenix14 (test) | WER20.7 | 39 | |
| Continuous Sign Language Recognition | Phoenix14 (dev) | WER20.9 | 29 | |
| Continuous Sign Language Recognition | PHOENIX 14 (dev test) | WER (Dev)20.9 | 16 | |
| Continuous Sign Language Recognition | PHOENIX14-T (dev test) | WER (Dev)20.5 | 14 |