CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
About
Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long Video Understanding | MLVU | -- | 205 | |
| Long Video Understanding | Video-MME (full) | Overall Performance64.84 | 51 | |
| Offline Video Understanding | VideoMME v1 (test) | Accuracy65.7 | 27 | |
| Offline Video Understanding | MLVU v1 (test) | Accuracy71.5 | 26 | |
| Offline Video Understanding | EgoSchema v1 (test) | Accuracy68.4 | 22 | |
| Multi-task Visual Reasoning | OVO-Bench | Backward Avg50.46 | 3 |