VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
About
This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Video Object Segmentation | Ref-YouTube-VOS (val) | J&F Score71.3 | 200 | |
| Referring Video Object Segmentation | MeViS (val) | J&F Score0.517 | 122 | |
| Video Grounding | Charades-STA | R@1 IoU=0.570 | 113 | |
| Dense Video Captioning | YouCook2 | SODA_c7.3 | 29 | |
| Video highlight detection | QVHighlights | mAP0.275 | 29 | |
| Referring Video Object Segmentation | ReVOS (val) | J&F Score63.1 | 8 | |
| Referring Expression Segmentation | RefCOCO (val test) | cIoU83.4 | 6 | |
| Referring Expression Segmentation | RefCOCO+ (val test) | cIoU79.2 | 6 | |
| Referring Expression Segmentation | RefCOCOg (val test) | cIoU81.4 | 6 | |
| Grounded Conversation Generation | Grand-f | AP5034.1 | 4 |