Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Agentic Keyframe Search for Video Question Answering

About

Video question answering (VideoQA) enables machines to extract and comprehend key information from videos through natural language interaction, which is a critical step towards achieving intelligence. However, the demand for a thorough understanding of videos and high computational costs still limit the widespread applications of VideoQA. To address it, we propose Agentic Keyframe Search (AKeyS), a simple yet powerful algorithm for identifying keyframes in the VideoQA task. It can effectively distinguish key information from redundant, irrelevant content by leveraging modern language agents to direct classical search algorithms. Specifically, we first segment the video and organize it as a tree structure. Then, AKeyS uses a language agent to estimate heuristics and movement costs while dynamically expanding nodes. Finally, the agent determines if sufficient keyframes have been collected based on termination conditions and provides answers. Extensive experiments on the EgoSchema and NExT-QA datasets show that AKeyS outperforms all previous methods with the highest keyframe searching efficiency, which means it can accurately identify key information and conduct effective visual reasoning with minimal computational overhead. For example, on the EgoSchema subset, it achieves 1.8% higher accuracy while processing only 43.5% of the frames compared to VideoTree. We believe that AKeyS represents a significant step towards building intelligent agents for video understanding. The code is publicly available at https://github.com/fansunqi/AKeyS.

Sunqi Fan, Meng-Hao Guo, Shuojin Yang• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringNExT-QA (test)
Accuracy78.1
204
Video Question AnsweringEgoSchema (Full)
Accuracy63.6
193
Video Question AnsweringEgoSchema subset
Accuracy68.6
73
Showing 3 of 3 rows

Other info

Code

Follow for update