Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

About

Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST .

Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, Si Liu• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA
Accuracy59
481
Video UnderstandingMVBench
Accuracy64.2
247
Temporal Video GroundingCharades-STA (test)
Recall@IoU=0.544.8
117
Open-ended Video Question AnsweringMSVD-QA
Accuracy75.9
59
Video Question AnsweringVCG Bench
CI3.29
42
Temporal GroundingCharades-STA
mIoU42.4
33
Spatio-Temporal Video GroundingVidSTG Declarative Sentences
m_vIoU8.2
20
Spatio-Temporal ReasoningV-Star
Accuracy55
14
Spatio-Temporal Video GroundingHC-STVG v1
m_vIoU9.4
11
Spatio-Temporal Video GroundingHC-STVG v2
m_tIoU21.6
9
Showing 10 of 18 rows

Other info

Code

Follow for update