Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

USTM: Unified Spatial and Temporal Modeling for Continuous Sign Language Recognition

About

Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at https://github.com/gufranSabri/USTM

Ahmed Abul Hasanaath, Hamzah Luqman• 2025

Related benchmarks

TaskDatasetResultRank
Continuous Sign Language RecognitionCSL-Daily (dev)
Word Error Rate (WER)27.7
98
Continuous Sign Language RecognitionCSL-Daily (test)
WER26.4
91
Continuous Sign Language RecognitionPHOENIX14-T (dev)
WER17.4
75
Continuous Sign Language RecognitionPHOENIX-2014T (test)
WER18.9
43
Continuous Sign Language RecognitionPhoenix14 (test)
WER17.6
39
Continuous Sign Language RecognitionPhoenix14 (dev)
WER17.4
29
Showing 6 of 6 rows

Other info

Follow for update