Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction

About

Sign language processing has traditionally relied on task-specific models, limiting the potential for transfer learning across tasks. Pre-training methods for sign language have typically focused on either supervised pre-training, which cannot take advantage of unlabeled data, or context-independent (frame or video segment) representations, which ignore the effects of relationships across time in sign language. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised contextual representation model learned from approximately 1,000 hours of American Sign Language video. SHuBERT adapts masked token prediction objectives to multi-stream visual sign language input, learning to predict multiple targets corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple tasks including sign language translation, isolated sign language recognition, and fingerspelling detection.

Shester Gueuwou, Xiaodan Du, Greg Shakhnarovich, Karen Livescu, Alexander H. Liu• 2024

Related benchmarks

TaskDatasetResultRank
Sign Language TranslationHow2Sign (test)--
61
Sign Language TranslationOpenASL (test)--
26
Isolated Sign Language RecognitionWLASL 2000 (test)
P-I60.9
6
Isolated Sign Language RecognitionASL Citizen (test)
Rec@165
4
Sign Language TranslationFLEURS ASL (test)
BLEU4.7
4
Isolated Sign Language RecognitionSem-Lex (test)
Rec@10.54
3
Fingerspelling detectionASL-Stem Wiki (cross-validation)
mIoU40
2
Showing 7 of 7 rows

Other info

Code

Follow for update