LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

About

Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo• 2024

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	VIMA-Bench 1.0 (test)	--	14
Robotic Manipulation	VIMABench L2 (test)	Accuracy84.6	12
Robot manipulation generalization	VIMA-Bench	Novel Task33.8	5
Cluttered Localization (T3)	Real-world robot experiments	Success Rate45	4
Color Match (T2)	Real-world robot experiments	Success Rate30	4
Relative Height (T4)	Real-world robot experiments	Success Rate5.00e+3	4
Target Object (T1)	Real-world robot experiments	Success Rate5.00e+3	4

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord