Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

About

Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and $\textbf{real-robot}$ tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, $\textbf{few-shot}$ offline RL on a suite of real-world robot tasks with as few as 20 trajectories.

Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, Amy Zhang• 2022

Related benchmarks

TaskDatasetResultRank
Open DoorMeta-World
VOC Score32.38
35
Reward ModelingMeta-World Open drawer
Prediction Accuracy65.58
28
open drawerMeta-World
VOC Score77.59
28
Button pressMeta-World
VOC Score77.94
28
Reward ModelingMeta-World Button press
Prediction Accuracy57.83
28
Reward ModelingMeta-World Open door
Prediction Accuracy60.12
28
ObjectNavGibson (val)
Success Rate27.87
18
Goal-conditioned Reinforcement Learningmanipulation-cube-single-play (test)
Success Rate0.4
11
Goal-conditioned Reinforcement Learningpointmaze navigate medium
Success Rate0.00e+0
11
Ordinal ConsistencyIn-the-wild 50 steps horizon v1 (test)
Kendall's Tau0.42
8
Showing 10 of 30 rows

Other info

Follow for update