Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing
About
The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sign language subtitle alignment | BOBSL (val) | F1@0.5074.81 | 9 | |
| Sign language subtitle alignment | BOBSL (test) | F1@0.500.6568 | 9 | |
| Sign language subtitle alignment | WMT-SLT (test) | F1@0.5077.69 | 6 | |
| Sign language subtitle alignment | WMT-SLT (val) | F1@0.5075.34 | 6 | |
| Sign language subtitle alignment | How2Sign (val) | F1@0.5038.32 | 5 | |
| Sign language subtitle alignment | How2Sign (test) | F1@0.5039.57 | 5 | |
| Sign language subtitle alignment | SwissSLi (val) | F1@0.5071.86 | 4 | |
| Sign language subtitle alignment | SwissSLi (test) | F1@0.5085.57 | 4 |