Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SCAR: Self-Supervised Continuous Action Representation Learning

About

Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.

Hongjia Liu, Fan Feng, Minghao Fu, Xinyue Wang, Haofei Lu, Biwei Huang• 2026

Related benchmarks

TaskDatasetResultRank
Robotic cross-embodiment transferRobotwin place_a2b_left Target task m=10
SSIM75.9
7
Robotic cross-embodiment transferRobotwin place_a2b_right Transfer task m=10
SSIM77
7
Virtual-embodiment transferProcgen Group G1: caveflyer/chaser/ninja (test)
SSIM59.4
7
Virtual-embodiment transferProcgen Group G2: heist jumper miner (test)
SSIM0.563
7
World Model PredictionRobotwin place_a2b_left (held-out)
SSIM75.9
7
Cross-Embodiment TransferRobotwin aloha-agilex target embodiment
SSIM79.6
3
Cross-task visual reconstructionRobotwin aloha-agilex task shift
SSIM79.5
3
Showing 7 of 7 rows

Other info

Follow for update