Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning Additively Compositional Latent Actions for Embodied AI

About

Latent action learning infers pseudo-action labels from visual transitions, providing an approach to leverage internet-scale video for embodied AI. However, most methods learn latent actions without structural priors that encode the additive, compositional structure of physical motion. As a result, latents often entangle irrelevant scene details or information about future observations with true state changes and miscalibrate motion magnitude. We introduce Additively Compositional Latent Action Model (AC-LAM), which enforces scene-wise additive composition structure over short horizons on the latent action space. These AC constraints encourage simple algebraic structure in the latent action space~(identity, inverse, cycle consistency) and suppress information that does not compose additively. Empirically, AC-LAM learns more structured, motion-specific, and displacement-calibrated latent actions and provides stronger supervision for downstream policy learning, outperforming state-of-the-art LAMs across simulated and real-world tabletop tasks.

Hangxing Wei, Xiaoyu Chen, Chuheng Zhang, Tim Pearce, Jianyu Chen, Alex Lamb, Li Zhao, Jiang Bian• 2026

Related benchmarks

TaskDatasetResultRank
Tabletop Manipulation Policy LearningEmoji Table-Top GrinningFace (ID)
Success (S)55
10
Tabletop manipulationReal-World Tabletop Manipulation (In-Distribution)
Success Rate60
5
Tabletop manipulationReal-World Tabletop Manipulation Out-of-Distribution Distractors
Success Rate53.3
5
Tabletop manipulationReal-World Tabletop Manipulation OOD-B (Out-of-Distribution Backgrounds)
Success Rate33.3
5
Tabletop Manipulation Policy LearningEmoji Table-Top GrinningFace (train)
Success Rate (S)42
5
Showing 5 of 5 rows

Other info

Follow for update