Vision-Language Models as a Source of Rewards
About
Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Autonomous Driving | CARLA Town 2 (test) | AS0.53 | 15 | |
| Autonomous Driving | CARLA Town 2 10 unseen (test) | AS Score0.06 | 12 | |
| Autonomous Driving | CARLA Town 2 (train) | AS1.49 | 12 | |
| Value function estimation | BridgeData dt_tk_stack Embodiment Shift V2 | VOC3.5 | 7 | |
| Value function estimation | BridgeData tk_pnp In-Distribution V2 | VOC0.038 | 7 | |
| Expert vs. Non-Expert Trajectory Discrimination | BridgeData 5 scripted datasets V2 (in-distribution) | BinVOC0.4 | 7 | |
| Value function estimation | BridgeData Environment Shift V2 (ft_fold) | VOC10.8 | 7 | |
| Value function estimation | BridgeData Environment Shift V2 (rd_fold) | VOC9.5 | 7 | |
| Value function estimation | BridgeData ms_sweep Environment Shift V2 | VOC-0.129 | 7 | |
| Value function estimation | BridgeData Embodiment Shift dt_tk_pnp V2 | VOC0.042 | 7 |