Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

About

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingVideoMME
Score (Long)64.9
248
Video Question AnsweringVideo-MME Long Duration 1.0--
34
Video UnderstandingVideoMMMU--
32
Long Video UnderstandingVideo MME w/o sub (long)
Accuracy64.9
30
Video UnderstandingWorldSense
Score42
25
Video Question AnsweringMA-EgoQA
SI Score22.34
23
Long Video UnderstandingEgoSchema (val)
Accuracy68.2
16
Multiple-choice Question AnsweringEgoLifeQA
Average Score36
13
Video UnderstandingPerceptionTest
Overall Score65.8
5
Showing 9 of 9 rows

Other info

Follow for update