Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Seeing the Arrow of Time in Large Multimodal Models

About

The Arrow of Time (AoT)-time's irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a critical analysis of existing benchmarks and models. We then introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness by encouraging divergent video interpretations between forward and reversed visual frames. For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions. Experiments show ArrowRL greatly advances temporal perception: it not only achieves substantial improvements on our challenging AoTBench but also demonstrably boosts performance on standard video question answering (VQA) benchmarks (with peak accuracy gains reaching over 20% and 10% respectively). This validates ArrowRL's effectiveness and highlights the critical need for dedicated AoT understanding in LMMs.

Zihui Xue, Mi Luo, Kristen Grauman• 2025

Related benchmarks

TaskDatasetResultRank
Video Hallucination EvaluationVideoHallucer
ORH60.5
25
Temporal UnderstandingTempCompass, TVBench
TempCompass Score0.726
17
Hallucination ExaminationVidHalluc, VideoHallucer, EventHallusion
VidHalluc Score73.2
17
Conventional Video UnderstandingVideoMMe, MVBench
VideoMMe Score49.6
17
Video-to-Text retrievalSomething-Something CiA-Retrieval v2
R@1 (Chiral)66.4
16
Text-to-Video RetrievalSomething-Something CiA-Retrieval v2
mAP (Chiral)67.5
16
Hallucination ExaminationEventHallusion
Average Score68.95
15
Hallucination ExaminationVidHalluc
BQA76.14
15
Video-to-Text retrievalReversedInTime
Binary Accuracy69.6
11
Text-to-Video RetrievalReversedInTime
Binary Accuracy57.1
11
Showing 10 of 10 rows

Other info

Follow for update