Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models

About

Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. Existing methods often introduce discontinuous coverage through coarse per-frame allocation or scene segmentation, and token merging can further misalign spatio-temporal coordinates under MRoPE-style discrete (t,h,w) bindings. To address these issues, we propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token budgets to semantic turns and event boundaries. It further adopts a dual-anchor spatial selection mechanism that preserves high-entropy visual evidence without attention intervention, while keeping retained tokens at their original coordinates to maintain positional alignment. Extensive experiments across multiple VideoLLMs of different architectures and scales demonstrate that V-CAST achieves 98.6% of the original performance, outperforms the second-best method by +1.1% on average, and reduces peak memory and total latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.

Xinying Lin, Xuyang Liu, Yiyu Wang, Teng Ma, Wenqi Ren• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
425
Video UnderstandingVideoMME
Score (Long)59.8
248
Long Video UnderstandingLongVideoBench
Score61.6
248
Long Video UnderstandingMLVU
Score64.7
154
Long Video UnderstandingLongVideo-Bench
Score61.2
89
Long Video UnderstandingMLVU (test)--
60
Video UnderstandingAggregate MVBench, LongVideo Bench, MLVU, VideoMME
Average Score68.1
59
Multi-discipline Long Video UnderstandingMLVU
Score67.1
44
Multi-modal Video EvaluationVideoMME--
42
Video Multi-modal EvaluationVideoMME
Overall Score66
13
Showing 10 of 14 rows

Other info

Follow for update