Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

What Happens When: Learning Temporal Orders of Events in Videos

About

Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding. We release our code, model and datasets.

Daechul Ahn, Yura Choi, Hyeonbeom Choi, Seongwon Cho, San Kim, Jonghyun Choi• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy57.56
247
Video UnderstandingMLVU--
54
Temporal Video UnderstandingTempCompass--
52
Video UnderstandingEgoSchema
Accuracy62.25
49
Event SequencingVECTOR L2 (Ne=8) 1.0 (test)
EM (Exact Match)833
26
Event SequencingVECTOR L1 (Ne=4) 1.0 (test)
EM Score55.67
26
Video UnderstandingVideoVista
Accuracy74.23
21
Event Sequencing (Full-sequence ordering)VECTOR L1
Exact Match (EM)4.17e+3
13
Event Sequencing (Full-sequence ordering)VECTOR L2
Exact Match (EM)4.33
13
Event Position Identification (Single event detection)VECTOR L2
Exact Match (EM)71.33
6
Showing 10 of 27 rows

Other info

Follow for update