Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

About

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

Zifan Jiang, Youngjoon Jang, Liliane Momeni, G\"ul Varol, Sarah Ebling, Andrew Zisserman• 2025

Related benchmarks

Task	Dataset	Result
Sign language subtitle alignment	BOBSL (val)	F1@0.5074.81	9
Sign language subtitle alignment	BOBSL (test)	F1@0.500.6568	9
Sign language subtitle alignment	WMT-SLT (test)	F1@0.5077.69	6
Sign language subtitle alignment	WMT-SLT (val)	F1@0.5075.34	6
Sign language subtitle alignment	How2Sign (val)	F1@0.5038.32	5
Sign language subtitle alignment	How2Sign (test)	F1@0.5039.57	5
Sign language subtitle alignment	SwissSLi (val)	F1@0.5071.86	4
Sign language subtitle alignment	SwissSLi (test)	F1@0.5085.57	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord