Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

About

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

Zifan Jiang, Youngjoon Jang, Liliane Momeni, G\"ul Varol, Sarah Ebling, Andrew Zisserman• 2025

Related benchmarks

TaskDatasetResultRank
Sign language subtitle alignmentBOBSL (val)
F1@0.5074.81
9
Sign language subtitle alignmentBOBSL (test)
F1@0.500.6568
9
Sign language subtitle alignmentWMT-SLT (test)
F1@0.5077.69
6
Sign language subtitle alignmentWMT-SLT (val)
F1@0.5075.34
6
Sign language subtitle alignmentHow2Sign (val)
F1@0.5038.32
5
Sign language subtitle alignmentHow2Sign (test)
F1@0.5039.57
5
Sign language subtitle alignmentSwissSLi (val)
F1@0.5071.86
4
Sign language subtitle alignmentSwissSLi (test)
F1@0.5085.57
4
Showing 8 of 8 rows

Other info

Follow for update