Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

About

To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is insufficient for sparse-reward tasks, as it lacks the epistemic drive to actively uncover the ``known unknown'' required for robust generalization. We ask: Can VLM agents actively find signals that challenge and refine their internal world model through curiosity-driven exploration? In this work, we propose GLANCE, a unified framework that bridges reasoning and exploration by grounding the agent's linguistic world model into the stable visual representations of an evolving target network. Crucially, GLANCE leverages the discrepancy between linguistic prediction and visual reality as an intrinsic curiosity signal within reinforcement learning, steering the agent to actively explore areas where its internal model is uncertain. Extensive experiments across a series of agentic tasks show the effectiveness of GLANCE, and demonstrate that aligning ``what the agent thinks'' with ``what the agent sees'' is key to solving complex or sparse agentic tasks.

Haoxi Li, Qinglin Hou, Jianfei Ma, Jinxiang Lai, Tao Han, Sikai Bai, Jingcai Guo, Jie Zhang, Song Guo• 2026

Related benchmarks

TaskDatasetResultRank
Visual Agentic ReasoningSokoban
Success Rate85
27
Embodied NavigationNavigation
Base Score86
17
Puzzle ReasoningFrozenLake
Success Rate78
17
Robotic ManipulationPrimitiveSkill
Place Success Rate100
17
SVG reconstructionsVG
Dino Score0.92
17
Showing 5 of 5 rows

Other info

Follow for update