Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VideoLucy: Deep Memory Backtracking for Long Video Understanding

About

Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io

Jialong Zuo, Yongtai Deng, Lingdong Kong, Jingkang Yang, Rui Jin, Yiwei Zhang, Nong Sang, Liang Pan, Ziwei Liu, Changxin Gao• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMLVU
Accuracy76.1
143
Long Video UnderstandingLVBench
Accuracy58.8
133
Video Question AnsweringLVBench
Accuracy58.8
108
Long Video UnderstandingMLVU (test)--
60
Video Question AnsweringVideo-MME
Accuracy (Average, wo/ Subtitle)72.5
48
Long Video UnderstandingVideo-MME Long
Accuracy66.8
46
Long Video UnderstandingVideo-MME Overall
Accuracy72.5
39
Showing 7 of 7 rows

Other info

Follow for update