Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer

About

A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.

Lihan Zha, Asher J. Hancock, Mingtong Zhang, Tenny Yin, Yixuan Huang, Dhruv Shah, Allen Z. Ren, Anirudha Majumdar• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement98.8
700
Robot PickingPick MSProc sim
Success Rate12.6
11
Robot ManipulationSimulation held-out environments
Pick Success Rate (MSProc)12.6
9
Action PredictionHeld-out Robot Embodiments (unseen)
Prediction Error15.1
3
Showing 4 of 4 rows

Other info

Follow for update