Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

About

Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.

Wencheng Ye, Tianshi Wang, Lei Zhu, Fengling Li, Guoli Yang, Hengtao Shen• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationSimplerEnv Google Robot tasks Variant Aggregation
Average Success Rate61.78
67
Robot ManipulationLIBERO
Spatial Success Rate81.8
30
Robotic ManipulationSIMPLER Visual Matching
Average Success Rate74.08
26
Robotic ManipulationARX5 Real-World
Task 1 Success Rate80
3
Showing 4 of 4 rows

Other info

Follow for update