CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

About

Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance. CogVLA draws inspiration from human multimodal coordination and introduces a 3-stage progressive architecture. 1) Encoder-FiLM based Aggregation Routing (EFA-Routing) injects instruction information into the vision encoder to selectively aggregate and compress dual-stream visual tokens, forming a instruction-aware latent representation. 2) Building upon this compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing) introduces action intent into the language model by pruning instruction-irrelevant visually grounded tokens, thereby achieving token-level sparsity. 3) To ensure that compressed perception inputs can still support accurate and coherent action generation, we introduce V-L-A Coupled Attention (CAtten), which combines causal vision-language attention with bidirectional action parallel decoding. Extensive experiments on the LIBERO benchmark and real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA. CogVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/CogVLA.

Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement98.8	957
Robotic Manipulation	LIBERO	Spatial Success Rate98.6	527
Robot Manipulation	LIBERO (test)	Average Success Rate97	220
Robot Manipulation	LIBERO simulation	Average Success Rate97.4	73
Robotic Manipulation	LIBERO	Spatial Success Rate98.6	13
Robotic Manipulation	LIBERO 40 (fine-tuning)	Spatial Success Rate98.6	9
Robotic Manipulation	CSOT-Bench Ours (fine-tuning)	Scene Success Rate81.2	8
Robotic Manipulation	Cobot Agilex ALOHA real-world	Object Placement (Cube->Plate) SR90	6
Robot Manipulation	LIBERO simulation	Spatial SR98.5	5
Robot Manipulation	Cobot Agilex ALOHA real-world	Red Cube -> Plate Success Rate80	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord