Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

About

Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.

Ailar Mahdizadeh, Puria Azadi, Muchen Li, Xiangteng He, Leonid Sigal• 2026

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingMLVU--
205
Long Video UnderstandingVideo-MME (full)
Overall Performance64.84
51
Offline Video UnderstandingVideoMME v1 (test)
Accuracy65.7
27
Offline Video UnderstandingMLVU v1 (test)
Accuracy71.5
26
Offline Video UnderstandingEgoSchema v1 (test)
Accuracy68.4
22
Multi-task Visual ReasoningOVO-Bench
Backward Avg50.46
3
Showing 6 of 6 rows

Other info

Follow for update