Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

About

Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.

Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, Jiangmiao Pang• 2025

Related benchmarks

TaskDatasetResultRank
Pixel GroundingEgoHOS Out-of-distribution (test)
Left Hand IoU29.33
18
Egocentric Interaction AnalysisEgo-IRGBench (test)
METEOR0.182
15
Egocentric Interaction AnsweringEgo-IRGBench (val)
METEOR0.495
15
Egocentric Interaction AnalysisEgo-IRGBench (val)
METEOR0.184
15
Egocentric Interaction AnsweringEgo-IRGBench (test)
METEOR49.1
15
Egocentric Interaction GroundingEgo-IRGBench (val)
cIoU12.87
15
Egocentric Interaction GroundingEgo-IRGBench (test)
cIoU12.33
15
Egocentric Video ReasoningEgoSchema
Accuracy68.2
5
Egocentric Video ReasoningEgoPlan
Accuracy34.1
5
Egocentric Video ReasoningEgoBlind
Accuracy43.8
5
Showing 10 of 15 rows

Other info

Follow for update