Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

About

This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.

Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
935
Text-based Visual Question AnsweringTextVQA
Accuracy77.1
496
Visual Question AnsweringAI2D
Accuracy76.7
174
Optical Character Recognition EvaluationOCRBench
Score78.3
46
Multi-modal Vision-Language UnderstandingMMVet
Score46.3
38
General Vision-Language UnderstandingMMB
Score70.1
25
Hallucination EvaluationHallB
Score54.2
19
Image-centric Multimodal UnderstandingSEED-I
Score72.9
16
High-quality Vision-Language EvaluationMMStar
Score53.1
14
Showing 9 of 9 rows

Other info

Follow for update