RoboBrain 2.0 Technical Report
About
We introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain 2.0 achieves strong performance across a wide spectrum of embodied reasoning tasks. On both spatial and temporal benchmarks, the 32B variant achieves leading results, surpassing prior open-source and proprietary models. In particular, it supports key real-world embodied AI capabilities, including spatial understanding (e.g., affordance prediction, spatial referring, trajectory forecasting) and temporal decision-making (e.g., closed-loop interaction, multi-agent long-horizon planning, and scene graph updating). This report details the model architecture, data construction, multi-stage training strategies, infrastructure and practical applications. We hope RoboBrain 2.0 advances embodied AI research and serves as a practical step toward building generalist embodied agents. The code, checkpoint and benchmark are available at https://superrobobrain.github.io.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy88.1 | 1455 | |
| Visual Question Answering | TextVQA | Accuracy81 | 1285 | |
| Multimodal Evaluation | MME | Score2.13e+3 | 658 | |
| Optical Character Recognition | OCRBench | Score857 | 232 | |
| Visual Grounding | RefCOCO+ (val) | Accuracy70.1 | 212 | |
| Visual Grounding | RefCOCO (val) | Accuracy76.1 | 147 | |
| Visual Grounding | RefCOCOg (val) | Accuracy62.9 | 114 | |
| Visual Question Answering | COCO | Score27.2 | 106 | |
| Visual Reasoning | BLINK | Accuracy81.4 | 76 | |
| Multimodal Reward Modeling | VL-RewardBench | Accuracy42.4 | 76 |