SCAR: Self-Supervised Continuous Action Representation Learning

About

Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.

Hongjia Liu, Fan Feng, Minghao Fu, Xinyue Wang, Haofei Lu, Biwei Huang• 2026

Related benchmarks

Task	Dataset	Result
Robotic cross-embodiment transfer	Robotwin place_a2b_left Target task m=10	SSIM75.9	7
Robotic cross-embodiment transfer	Robotwin place_a2b_right Transfer task m=10	SSIM77	7
Virtual-embodiment transfer	Procgen Group G1: caveflyer/chaser/ninja (test)	SSIM59.4	7
Virtual-embodiment transfer	Procgen Group G2: heist jumper miner (test)	SSIM0.563	7
World Model Prediction	Robotwin place_a2b_left (held-out)	SSIM75.9	7
Cross-Embodiment Transfer	Robotwin aloha-agilex target embodiment	SSIM79.6	3
Cross-task visual reconstruction	Robotwin aloha-agilex task shift	SSIM79.5	3

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord