DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

About

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang• 2024

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO	Spatial Success Rate97	527
Robotic Manipulation	Calvin ABCD→D	Avg Length4.13	130
Long-horizon robot manipulation	Calvin ABCD→D	Task 1 Completion Rate98.2	127
Long-horizon task completion	Calvin ABC->D	Success Rate (1)86.2	67
Sequential Robotic Manipulation	CALVIN	Success Rate (1 task)84.8	63
Robot Manipulation	Calvin ABC->D	Average Successful Length2.9	62
Robotic Manipulation	CALVIN D->D	--	40
Long-horizon task success	CALVIN D→D long-horizon	Success Rate (LH-1)99.1	11
Object Goal Navigation	MP3D (val)	SR81	11
Robot Manipulation	CALVIN D->D	Average Length2.83	7

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord