CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling

About

Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. We address these limitations by leveraging video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach, CoPE-VideoLM, reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal and motion reasoning, long-form understanding, and spatial scene understanding.

Sayan Deb Sarkar, R\'emi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy61.9	563
3D Question Answering	ScanQA (val)	CIDEr95.1	290
Video Question Answering	ActivityNet-QA (test)	Accuracy58.8	288
Video Question Answering	VideoMME	Accuracy60.1	251
Long Video Understanding	LongVideoBench (val)	Accuracy56.9	225
Long Video Understanding	LVBench	Accuracy46.4	218
Video Understanding	MVBench (test)	Accuracy61.6	190
Temporal Video Understanding	TempCompass	Accuracy68.9	141
3D Question Answering	SQA3D (test)	EM@157.1	131
Video Question Answering	NEXT-QA	Overall Accuracy81.8	105

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord