Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

About

This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.

Jiapeng Shi, Junke Wang, Zuyao You, Bo He, Zuxuan Wu• 2026

Related benchmarks

TaskDatasetResultRank
Referring Video Object SegmentationRef-YouTube-VOS (val)
J&F Score71.3
200
Referring Video Object SegmentationMeViS (val)
J&F Score0.517
122
Video GroundingCharades-STA
R@1 IoU=0.570
113
Dense Video CaptioningYouCook2
SODA_c7.3
29
Video highlight detectionQVHighlights
mAP0.275
29
Referring Video Object SegmentationReVOS (val)
J&F Score63.1
8
Referring Expression SegmentationRefCOCO (val test)
cIoU83.4
6
Referring Expression SegmentationRefCOCO+ (val test)
cIoU79.2
6
Referring Expression SegmentationRefCOCOg (val test)
cIoU81.4
6
Grounded Conversation GenerationGrand-f
AP5034.1
4
Showing 10 of 11 rows

Other info

GitHub

Follow for update