Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-Language Models as a Source of Rewards

About

Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

Kate Baumli, Satinder Baveja, Feryal Behbahani, Harris Chan, Gheorghe Comanici, Sebastian Flennerhag, Maxime Gazeau, Kristian Holsheimer, Dan Horgan, Michael Laskin, Clare Lyle, Hussain Masoom, Kay McKinney, Volodymyr Mnih, Alexander Neitz, Dmitry Nikulin, Fabio Pardo, Jack Parker-Holder, John Quan, Tim Rockt\"aschel, Himanshu Sahni, Tom Schaul, Yannick Schroecker, Stephen Spencer, Richie Steigerwald, Luyu Wang, Lei Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Autonomous DrivingCARLA Town 2 (test)
AS0.53
15
Autonomous DrivingCARLA Town 2 10 unseen (test)
AS Score0.06
12
Autonomous DrivingCARLA Town 2 (train)
AS1.49
12
Value function estimationBridgeData dt_tk_stack Embodiment Shift V2
VOC3.5
7
Value function estimationBridgeData tk_pnp In-Distribution V2
VOC0.038
7
Expert vs. Non-Expert Trajectory DiscriminationBridgeData 5 scripted datasets V2 (in-distribution)
BinVOC0.4
7
Value function estimationBridgeData Environment Shift V2 (ft_fold)
VOC10.8
7
Value function estimationBridgeData Environment Shift V2 (rd_fold)
VOC9.5
7
Value function estimationBridgeData ms_sweep Environment Shift V2
VOC-0.129
7
Value function estimationBridgeData Embodiment Shift dt_tk_pnp V2
VOC0.042
7
Showing 10 of 15 rows

Other info

Follow for update