What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

About

To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning ``what the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.

Haoxi Li, Qinglin Hou, Jianfei Ma, Jinxiang Lai, Tao Han, Sikai Bai, Jingcai Guo, Jie Zhang, Song Guo• 2026

Related benchmarks

Task	Dataset	Result
Visual Agentic Reasoning	Sokoban	Success Rate85	27
Embodied Navigation	Navigation	Base Score86	17
Puzzle Reasoning	FrozenLake	Success Rate78	17
Robotic Manipulation	PrimitiveSkill	Place Success Rate100	17
SVG reconstruction	sVG	Dino Score0.92	17

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord