Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TANGO: Training-free Embodied AI Agents for Open-world Tasks

About

Large Language Models (LLMs) have demonstrated excellent capabilities in composing various modules together to create programs that can perform complex reasoning tasks on images. In this paper, we propose TANGO, an approach that extends the program composition via LLMs already observed for images, aiming to integrate those capabilities into embodied agents capable of observing and acting in the world. Specifically, by employing a simple PointGoal Navigation model combined with a memory-based exploration policy as a foundational primitive for guiding an agent through the world, we show how a single model can address diverse tasks without additional training. We task an LLM with composing the provided primitives to solve a specific task, using only a few in-context examples in the prompt. We evaluate our approach on three key Embodied AI tasks: Open-Set ObjectGoal Navigation, Multi-Modal Lifelong Navigation, and Open Embodied Question Answering, achieving state-of-the-art results without any specific fine-tuning in challenging zero-shot scenarios.

Filippo Ziliotto, Tommaso Campari, Luciano Serafini, Lamberto Ballan• 2024

Related benchmarks

TaskDatasetResultRank
Object Goal NavigationHM3D-OVON unseen (val)
Success Rate35.5
47
Open-set ObjectGoal NavigationHM3D-OVON unseen (val)
SR35.5
42
Embodied Question AnsweringA-EQA--
25
Multi-Modal Lifelong NavigationGOAT-Bench unseen (val)
SR32.1
22
Open-Vocabulary Object Goal NavigationHM3D OVON (test)
SR35.5
17
Object Goal NavigationHM3D OVON
SR35.5
14
Goal-conditioned navigationGOAT-Bench
SR32.1
12
Object Goal NavigationHM3D OVON v1 (val unseen)
SR35.5
12
Embodied NavigationGOAT-Bench unseen (val)
Success Rate (SR)32.1
10
Embodied Question AnsweringOpenEQA v1 (test)
Score37.2
5
Showing 10 of 11 rows

Other info

Follow for update