Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

About

A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing Vision-Language-Action (VLA) models for robots can handle a range of basic tasks, they still face challenges in two areas: (1) insufficient reasoning ability to tackle complex tasks, and (2) high computational costs for VLA model fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic VLA model that leverages Mamba to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual tokens with language embedding through co-training, empowering our model with visual common sense and robotic-related reasoning. To further equip RoboMamba with SE(3) pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0.1\% of the model) and time. In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 3 times faster than existing VLA models. Our project web page: https://sites.google.com/view/robomamba-web

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, Shanghang Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy79.6
1165
Visual Question AnsweringVizWiz
Accuracy58.1
1043
Visual Question AnsweringGQA
Accuracy64.4
963
Object Hallucination EvaluationPOPE
Accuracy87
935
Multimodal EvaluationMME
Score1.34e+3
557
Visual Question AnsweringOKVQA
Top-1 Accuracy63.3
283
Multimodal Capability EvaluationMM-Vet
Score31.4
282
Multimodal BenchmarkingMMBench
Score65.7
62
Multimodal EvaluationMMB
Score60.9
27
Embodied Question AnsweringRoboVQA
BLEU-154.9
13
Showing 10 of 14 rows

Other info

Code

Follow for update