Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

About

Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.

Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner• 2024

Related benchmarks

Task	Dataset	Result
Robot Manipulation	Adroit	Pen Task Score84	50
Robotic Manipulation	Franka-Kitchen	Avg Success Rate49.9	39
Robotic Manipulation	Meta-World	Average Success Rate0.949	27
Multi-task imitation learning	LIBERO Long	Success Rate86.7	11
Open Vocabulary Mobile Manipulation	OVMM	Success Rate43.6	11
Vision-based robot policy learning	MetaWorld	Assembly Success Rate92	8
Vision-based robot policy learning	DeepMind Control	Stand Performance85.5	8
Image-Goal Navigation	Gibson (14 held-out scenes)	Success Rate73.9	7
Referring Expression Grounding	OCID-Ref	Accuracy (Overall)92.9	7
Grasp Affordance Prediction	Grasp Affordance Prediction (test)	Top99 Accuracy72.9	6

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord