Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

About

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

Sayan Deb Sarkar, R\'emi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu• 2026

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringActivityNet-QA (test)
Accuracy58.8
275
Long Video UnderstandingLongVideoBench (val)
Accuracy56.9
139
3D Question AnsweringScanQA (val)
CIDEr95.1
133
Video Question AnsweringNEXT-QA
Overall Accuracy81.8
105
Video Question AnsweringVideoMME
Accuracy60.1
99
Video UnderstandingMVBench (test)
Accuracy61.6
97
3D Question AnsweringSQA3D (test)
EM@157.1
55
Video Question AnsweringVideoMME wo sub
Accuracy61.7
51
Video UnderstandingTempCompass MCQ (test)
Accuracy68.4
33
Video Question AnsweringNextQA MC
Score81.8
24
Showing 10 of 18 rows

Other info

Follow for update