LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos
About
In this paper we introduce LifelongMemory, a new framework for accessing long-form egocentric videographic memory through natural language question answering and retrieval. LifelongMemory generates concise video activity descriptions of the camera wearer and leverages the zero-shot capabilities of pretrained large language models to perform reasoning over long-form video context. Furthermore, LifelongMemory uses a confidence and explanation module to produce confident, high-quality, and interpretable answers. Our approach achieves state-of-the-art performance on the EgoSchema benchmark for question answering and is highly competitive on the natural language query (NLQ) challenge of Ego4D. Code is available at https://github.com/agentic-learning-ai-lab/lifelong-memory.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | EgoSchema (Full) | Accuracy64.7 | 193 | |
| Video Question Answering | EgoSchema subset | Accuracy72 | 73 | |
| Video Question Answering | EgoSchema 500-question subset | Accuracy68 | 50 | |
| Video Question Answering | EgoSchema 5031 videos (test) | Top-1 Accuracy62.4 | 26 | |
| Egocentric Video Question Answering | EgoSchema (public leaderboard) | Accuracy68 | 13 |