Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

About

Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.

Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, Hui En Pang, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang• 2025

Related benchmarks

TaskDatasetResultRank
Chart UnderstandingChartQA
Accuracy80.85
127
Vision UnderstandingMMVP
Accuracy86.33
33
Visual UnderstandingBLINK
Accuracy69.86
21
Visual UnderstandingCV-Bench
Accuracy85.46
12
Visual UnderstandingSAT
Accuracy73.3
11
Visual UnderstandingVisPuzzle
Accuracy78
11
Visual UnderstandingBLINK-J
Accuracy77.33
11
Visual UnderstandingVSP
Accuracy57.33
11
Visual UnderstandingVSTAR
Accuracy71.73
11
Spatial ReasoningSpatialScore 1.0 (test)
Overall Score30.62
10
Showing 10 of 10 rows

Other info

Follow for update