The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

About

This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.

Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Text-based Visual Question Answering	TextVQA	Accuracy77.1	984
Optical Character Recognition	OCRBench	Score78.3	486
Visual Question Answering	AI2D	Accuracy76.7	402
Hallucination Evaluation	HallusionBench	Accuracy54.2	153
Optical Character Recognition Evaluation	OCRBench	Score78.3	91
Multi-modal Vision-Language Understanding	MMVet	Score46.3	38
General Vision-Language Understanding	MMB	Score70.1	25
Hallucination Evaluation	HallB	Score54.2	19
Image-centric Multimodal Understanding	SEED-I	Score72.9	16

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord