Less is More: Label-Guided Summarization of Procedural and Instructional Videos

About

Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what's happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out generic or hallucinated content, resulting in contextually coherent summaries across both domain-specific and instructional videos. We evaluate our method on instructional and activity datasets, using reference summaries for instructional videos. Despite sampling fewer than 5% of the original frames, our summaries retain 84% semantic content while improving over baselines by as much as 33%. Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.

Shreya Rajpal, Michal Golovanevsky, Carsten Eickhoff• 2026

Related benchmarks

Task	Dataset	Result
Video Captioning	ActivityNet Captions (val)	METEOR20.04	22
Video Level Summarization	YouCook2	METEOR30.08	21
Video Summarization	Cholec80	--	3
Video Summarization	PIT-VIS	--	3

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord