EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
About
Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Pixel Grounding | EgoHOS Out-of-distribution (test) | Left Hand IoU29.33 | 18 | |
| Egocentric Interaction Analysis | Ego-IRGBench (test) | METEOR0.182 | 15 | |
| Egocentric Interaction Answering | Ego-IRGBench (val) | METEOR0.495 | 15 | |
| Egocentric Interaction Analysis | Ego-IRGBench (val) | METEOR0.184 | 15 | |
| Egocentric Interaction Answering | Ego-IRGBench (test) | METEOR49.1 | 15 | |
| Egocentric Interaction Grounding | Ego-IRGBench (val) | cIoU12.87 | 15 | |
| Egocentric Interaction Grounding | Ego-IRGBench (test) | cIoU12.33 | 15 | |
| Egocentric Video Reasoning | EgoSchema | Accuracy68.2 | 5 | |
| Egocentric Video Reasoning | EgoPlan | Accuracy34.1 | 5 | |
| Egocentric Video Reasoning | EgoBlind | Accuracy43.8 | 5 |