Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

About

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.

Yikun Liu, Yuan Liu, Le Tian, Xiao Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringLiveVQA
Accuracy77.7
108
Visual Question AnsweringSimpleVQA
Accuracy0.688
99
Fact-based Visual Question AnsweringFVQA
Accuracy71.2
46
Agentic SearchMMSearch v1.0 (test)
Accuracy70.8
21
Agentic SearchBC-VL
Accuracy44.4
18
Agentic SearchMMSearch+
Accuracy25.2
10
Showing 6 of 6 rows

Other info

Follow for update