Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

About

We present Pelican-Unify 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unify 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unify 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan, Yechen Wu, Zeting Liu, Xianzhou Hou, Yong Dai, Jian Tang, Xiaozhu Ju• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench--
847
Multi-discipline Multimodal UnderstandingMMMU--
363
Information Visual Question AnsweringInfoVQA
Accuracy78.4
110
Visual Question AnsweringChartQA
Score81.5
24
Physical ReasoningPhyX--
24
Multimodal ReasoningMMStar
Score63.3
18
World Model EvaluationWorld Arena Benchmark
EWM Score66.03
15
Robotic ManipulationRoboTwin 50-task (Seen Tasks)
Clean Success Rate93.6
14
Human evaluation of robot rollout generationWorldArena rollouts
Task Success1.81
8
Embodied Spatial GroundingWhere2Place
Score45.2
4
Showing 10 of 11 rows

Other info

Follow for update