Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

About

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second "baseline" prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner• 2023

Related benchmarks

TaskDatasetResultRank
Autonomous DrivingCARLA Town 2 (test)
AS0.2
15
Autonomous DrivingCARLA Town 2 10 unseen (test)
AS Score0.08
12
Autonomous DrivingCARLA Town 2 (train)
AS10.86
12
Value function estimationBridgeData tk_pnp In-Distribution V2
VOC0.029
7
Value function estimationBridgeData dt_tk_stack Embodiment Shift V2
VOC4.6
7
Value function estimationBridgeData lm_pnp Environment Shift V2
VOC3.3
7
Value function estimationBridgeData Environment Shift V2 (td_fold)
VOC7.2
7
Value function estimationBridgeData dt_ft_stack ES & EM V2
VOC2.8
7
Value function estimationBridgeData dt_rd_pnp ES & EM V2
VOC Score4.1
7
Expert vs. Non-Expert Trajectory DiscriminationBridgeData 5 scripted datasets V2 (in-distribution)
BinVOC0.00e+0
7
Showing 10 of 19 rows

Other info

Follow for update