MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production
About
Sign language understanding has made significant strides; however, there is still no viable solution for generating sign sequences directly from entire spoken content, e.g., text or speech. In this paper, we propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. In particular, a sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Moreover, by creating a joint embedding space for text, audio, and sign, we bind these modalities and leverage the semantic consistency among them to provide informative feedback for the model training. This embedding-consistency learning strategy minimizes the reliance on sign triplets and ensures continuous model refinement, even with a missing audio modality. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sign Language Translation | PHOENIX-2014T (test) | BLEU-412.77 | 159 | |
| Sign Language Translation | How2Sign (test) | BLEU-44.26 | 61 | |
| Sign Language Production | How2Sign | User Study Score2.65 | 5 | |
| Sign Language Production | PHOENIX14T | User Study Score3.21 | 5 |