Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

About

Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.

Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner• 2024

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationMeta-World
Average Success Rate0.949
27
Robotic ManipulationFranka-Kitchen
Avg Success Rate49.9
24
Open Vocabulary Mobile ManipulationOVMM
Success Rate43.6
11
Image-Goal NavigationGibson (14 held-out scenes)
Success Rate73.9
7
Referring Expression GroundingOCID-Ref
Accuracy (Overall)92.9
7
Grasp Affordance PredictionGrasp Affordance Prediction (test)
Top99 Accuracy72.9
6
Showing 6 of 6 rows

Other info

Code

Follow for update