ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning

About

Learning efficient representations for decision-making policies is a challenge in imitation learning (IL). Current IL methods require expert demonstrations, which are expensive to collect. Additionally, they are not explicitly trained to understand the environment. Consequently, they have underdeveloped world models. Self-supervised learning (SSL) offers an alternative, as it can learn a world model from diverse, unlabeled data. However, most SSL methods are inefficient because they operate in raw input space. In this work, we propose ACT-JEPA, a novel architecture that unifies IL and SSL to enhance policy representations. It is trained end-to-end to jointly predict 1) action sequences and 2) latent observation sequences. To learn in latent space, we utilize Joint-Embedding Predictive Architecture, which allows the model to filter out irrelevant details and learn a robust world model. We evaluate ACT-JEPA in different environments and across multiple tasks. Our results show that it outperforms the strongest baseline in all environments. ACT-JEPA achieves up to 40% improvement in world model understanding and up to 10% higher task success rate. Finally, we show that predicting latent observation sequences effectively generalizes to predicting action sequences. This work demonstrates how integrating IL and SSL leads to efficient policy representation learning, an improved world model, and a higher task success rate.

Aleksandar Vujinovic, Aleksandar Kovacevic• 2025

Related benchmarks

Task	Dataset	Result
Policy Learning	Push-T 1 task	Average Task Success Rate41	3
Policy Learning	ManiSkill 5 tasks	Average Task Success Rate36	3
Policy Learning	Meta-World 15 tasks	Average Task Success Rate92	3

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord