Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence
About
This report presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7 billion to 72 billion. Our explicit mission is clearly stated as: To embed powerful intelligence into various embodiments. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms. Specifically, metaloop distilled a high-quality dataset from a raw dataset containing 4+ billion tokens. Pelican-VL 1.0 is trained on a large-scale cluster of 1000+ A800 GPUs, consuming over 50k+ A800 GPU-hours per checkpoint. This translates to a 20.3% performance uplift from its base model and outperforms 100B-level open-source counterparts by 10.6%, placing it on par with leading proprietary systems on well-known embodied benchmarks. We establish a novel framework, DPPO (Deliberate Practice Policy Optimization), inspired by human metacognition to train Pelican-VL 1.0. We operationalize this as a metaloop that teaches the AI to practice deliberately, which is a RL-Refine-Diagnose-SFT loop.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Reasoning | BLINK | Accuracy56.8 | 76 | |
| Spatial Reasoning | MindCube | Accuracy31 | 69 | |
| Spatial Reasoning | EmbSpatial | Overall Accuracy73.2 | 63 | |
| Spatial Reasoning | SITE | Accuracy52.3 | 39 | |
| Embodied Task Completion | EB-Habitat | -- | 32 | |
| Embodied Reasoning and Question Answering | ERQA | Score39.8 | 30 | |
| Embodied Question Answering | OpenEQA | Score63.3 | 21 | |
| Visual Question Answering | AircopBench | Accuracy50.8 | 17 | |
| Visual Spatial Intelligence | VSI | Accuracy52.8 | 17 | |
| Spatial Aptitude | SAT | Accuracy67.3 | 17 |