LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer

About

A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.

Lihan Zha, Asher J. Hancock, Mingtong Zhang, Tenny Yin, Yixuan Huang, Dhruv Shah, Allen Z. Ren, Anirudha Majumdar• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement99	957
Robot Manipulation	Simulation held-out environments	Pick Success Rate (MSProc)19.4	14
Robot Picking	Pick MSProc sim	Success Rate12.6	11
Pick	MolmoSpace	Success Rate24.9	5
Open	MolmoSpace	Success Rate11.4	4
Close	MolmoSpace	Success Rate45.9	4
Pick-&-Place	MolmoSpace	Success Rate6.6	4
Action Prediction	Held-out Robot Embodiments (unseen)	Prediction Error15.1	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord