SCAR: Self-Supervised Continuous Action Representation Learning
About
Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic cross-embodiment transfer | Robotwin place_a2b_left Target task m=10 | SSIM75.9 | 7 | |
| Robotic cross-embodiment transfer | Robotwin place_a2b_right Transfer task m=10 | SSIM77 | 7 | |
| Virtual-embodiment transfer | Procgen Group G1: caveflyer/chaser/ninja (test) | SSIM59.4 | 7 | |
| Virtual-embodiment transfer | Procgen Group G2: heist jumper miner (test) | SSIM0.563 | 7 | |
| World Model Prediction | Robotwin place_a2b_left (held-out) | SSIM75.9 | 7 | |
| Cross-Embodiment Transfer | Robotwin aloha-agilex target embodiment | SSIM79.6 | 3 | |
| Cross-task visual reconstruction | Robotwin aloha-agilex task shift | SSIM79.5 | 3 |