SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

About

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent significantly improves accuracy over both proprietary (+7.9%) and open-source models (+9.8%).

Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar• 2025

Related benchmarks

Task	Dataset	Result
Document Visual Question Answering	SlideVQA	Accuracy0.849	53
Slide Question Answering	SlideVQA	Overall Score72.7	29
End-to-end Question Answering	TechSlides	Overall Score70.9	25
End-to-end Question Answering	FinSlides	Overall Score85.5	25
End-to-End Document Question Answering	InfoVQA (test)	Overall Score79.6	8
Document Visual Question Answering	DocVQA	Overall Score94.7	4
Visual Question Answering	SlideVQA (test)	Overall Accuracy90.5	4

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord