Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

About

Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://inter-token-contrast.github.io/icon/

Junlin Wang, Zhiyun Lin• 2025

Related benchmarks

TaskDatasetResultRank
open boxRLBench
Success Rate30
10
Close DrawerRLBench
Success Rate91.3
5
Close MicrowaveRLBench
Success Rate100
5
Put Rubbish in BinRLBench
Success Rate9.3
5
Take Lid off SaucepanRLBench
Success Rate41.3
5
LiftRobosuite Franka (Default Gripper) few-shot transfer
Success Rate62.7
2
LiftRobosuite Kinova (Robotiq85) few-shot transfer
Success Rate26
2
LiftRobosuite Target Robot: IIWA (Robotiq140) few-shot transfer
Success Rate10
2
StackRobosuite Franka (Default Gripper) few-shot transfer
Success Rate22
2
StackRobosuite Kinova (Robotiq85) few-shot transfer
Success Rate5.3
2
Showing 10 of 11 rows

Other info

Follow for update