Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues

About

Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent. While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human. To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, while existing language-driven instance navigation methods struggle in complex multi-instance scenes. Code and benchmark will be available upon acceptance at https://intelligolabs.github.io/CoIN/

Francesco Taioli, Edoardo Zorzi, Gianni Franchi, Alberto Castellini, Alessandro Farinelli, Marco Cristani, Yiming Wang• 2024

Related benchmarks

Task	Dataset	Result
Interactive Graph Navigation	IGN Qwen3.5-8B backbone (evaluation)	SR@1.5 (Easy)23.6	12
Contextual Instance Navigation	CoIN-Bench Seen Synonyms (val)	Success Rate (SR)14.4	6
Contextual Instance Navigation	CoIN-Bench Unseen (val)	SR6.7	6
Navigation	CoIN Seen QAsk-Nav annotations (val)	Success Rate (SR)10.46	6
Navigation	CoIN Seen Synonyms (QAsk-Nav annotations) (val)	SR14.89	6
Navigation	CoIN Unseen (QAsk-Nav annotations) (val)	SR7.83	6
Contextual Instance Navigation	CoIN-Bench Seen (val)	SR7.4	6
Question-Answering Navigation	QAsk-Nav	Success Rate (SR)30.27	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord