Learning Additively Compositional Latent Actions for Embodied AI

About

Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.

Hangxing Wei, Xiaoyu Chen, Chuheng Zhang, Tim Pearce, Jianyu Chen, Alex Lamb, Li Zhao, Jiang Bian• 2026

Related benchmarks

Task	Dataset	Result
Tabletop Manipulation Policy Learning	Emoji Table-Top GrinningFace (ID)	Success (S)55	10
Tabletop manipulation	Real-World Tabletop Manipulation (In-Distribution)	Success Rate60	5
Tabletop manipulation	Real-World Tabletop Manipulation Out-of-Distribution Distractors	Success Rate53.3	5
Tabletop manipulation	Real-World Tabletop Manipulation OOD-B (Out-of-Distribution Backgrounds)	Success Rate33.3	5
Tabletop Manipulation Policy Learning	Emoji Table-Top GrinningFace (train)	Success Rate (S)42	5

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord