Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

About

Egocentric video understanding requires procedural reasoning under partial observability and continuously shifting viewpoints. Current multimodal large language models (MLLMs) struggle with this setting, often generating plausible but visually inconsistent or weakly grounded responses. We introduce $\textbf{EgoVITA}$, a framework that decomposes egocentric video reasoning into a structured $\textit{plan-then-verify}$ process. The model first generates an $\textbf{egocentric plan}$: a causal sequence of anticipated actions from a first-person perspective. This plan is then evaluated by an $\textbf{exocentric verification}$ stage that validates spatiotemporal and logical consistency from a third-person viewpoint. This decomposition enables cross-perspective feedback without requiring paired ego-exo supervision. To train this reasoning process, we adopt Group Relative Policy Optimization (GRPO) with two dense reward signals: one that aligns intermediate plan steps with future visual states and another that reinforces consistent third-person verification. EgoVITA achieves state-of-the-art performance on egocentric reasoning benchmarks, outperforming Qwen2.5-VL-7B by $\mathbf{+7.7}$ on EgoBlind and $\mathbf{+4.4}$ on EgoOrient, while maintaining strong generalization on exocentric video tasks with only $47k$ training samples.

Yogesh Kulkarni, Pooyan Fazli• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingVideo-MME
Overall Score72.2
92
Egocentric Video UnderstandingEgoBlind
Score51.9
13
Egocentric Video UnderstandingEgoPlan
Score35.9
13
Egocentric Video UnderstandingEgoThink
Score63.9
13
Egocentric Video UnderstandingEOC-Bench
Score48.6
13
Egocentric Video UnderstandingEgoOrient
Primary Score63.1
13
Exocentric Video UnderstandingMVBench
Score73.8
13
Exocentric Video UnderstandingLVBench
Score56.5
13
Exocentric Video UnderstandingTOMATO
Score34.5
13
Basic Exocentric UnderstandingMMMU
Accuracy51.3
5
Showing 10 of 17 rows

Other info

Follow for update