Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

About

Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.

Ye Li, Huanan Liu, Kangye Ji, Yuan Meng, Jiajun Fan, Yuansong Wang, Shiyu Qin, Chenglei Wu, Shu-Tao Xia, Zhi Wang• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationSimplerEnv Google Robot tasks Variant Aggregation
Average Success Rate72.54
88
Robot Manipulation SimulationSimplerEnv Google Robot GR00T (simulation)
Close Success Rate81
4
Robot Manipulation SimulationSimplerEnv WidowX GR00T (simulation)
Success Rate (Carrot)63.5
4
Pineapple Bun (Conveyor-belt pickup)Franka Research Real-world 3
Success Rate80
2
Toast (Conveyor-belt pickup)Franka Research Real-world 3
Success Rate80
2
Chocolate (Conveyor-belt pickup)Franka Research Real-world 3
Success Rate60
2
Pen HolderFranka Research 3 Real-world
Success Rate70
2
Phone StandFranka Research 3 Real-world
Success Rate60
2
Stack bowlsFranka Research Real-world 3
Success Rate40
2
Showing 9 of 9 rows

Other info

Follow for update