Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

About

Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://inter-token-contrast.github.io/icon/

Junlin Wang, Zhiyun Lin• 2025

Related benchmarks

Task	Dataset	Result
open box	RLBench	Success Rate30	13
Close Drawer	RLBench	Success Rate91.3	5
Close Microwave	RLBench	Success Rate100	5
Put Rubbish in Bin	RLBench	Success Rate9.3	5
Take Lid off Saucepan	RLBench	Success Rate41.3	5
Lift	Robosuite Franka (Default Gripper) few-shot transfer	Success Rate62.7	2
Lift	Robosuite Kinova (Robotiq85) few-shot transfer	Success Rate26	2
Lift	Robosuite Target Robot: IIWA (Robotiq140) few-shot transfer	Success Rate10	2
Stack	Robosuite Franka (Default Gripper) few-shot transfer	Success Rate22	2
Stack	Robosuite Kinova (Robotiq85) few-shot transfer	Success Rate5.3	2

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord