Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues
About
Language-driven instance object navigation assumes that human users initiate the task by providing a detailed description of the target instance to the embodied agent. While this description is crucial for distinguishing the target from visually similar instances in a scene, providing it prior to navigation can be demanding for human. To bridge this gap, we introduce Collaborative Instance object Navigation (CoIN), a new task setting where the agent actively resolve uncertainties about the target instance during navigation in natural, template-free, open-ended dialogues with human. We propose a novel training-free method, Agent-user Interaction with UncerTainty Awareness (AIUTA), which operates independently from the navigation policy, and focuses on the human-agent interaction reasoning with Vision-Language Models (VLMs) and Large Language Models (LLMs). First, upon object detection, a Self-Questioner model initiates a self-dialogue within the agent to obtain a complete and accurate observation description with a novel uncertainty estimation technique. Then, an Interaction Trigger module determines whether to ask a question to the human, continue or halt navigation, minimizing user input. For evaluation, we introduce CoIN-Bench, with a curated dataset designed for challenging multi-instance scenarios. CoIN-Bench supports both online evaluation with humans and reproducible experiments with simulated user-agent interactions. On CoIN-Bench, we show that AIUTA serves as a competitive baseline, while existing language-driven instance navigation methods struggle in complex multi-instance scenes. Code and benchmark will be available upon acceptance at https://intelligolabs.github.io/CoIN/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Contextual Instance Navigation | CoIN-Bench Seen Synonyms (val) | Success Rate (SR)14.4 | 6 | |
| Contextual Instance Navigation | CoIN-Bench Unseen (val) | SR6.7 | 6 | |
| Navigation | CoIN Seen QAsk-Nav annotations (val) | Success Rate (SR)10.46 | 6 | |
| Navigation | CoIN Seen Synonyms (QAsk-Nav annotations) (val) | SR14.89 | 6 | |
| Navigation | CoIN Unseen (QAsk-Nav annotations) (val) | SR7.83 | 6 | |
| Contextual Instance Navigation | CoIN-Bench Seen (val) | SR7.4 | 6 | |
| Question-Answering Navigation | QAsk-Nav | Success Rate (SR)30.27 | 5 |