VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

About

Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.

Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU49.6	3069
Semantic segmentation	PASCAL VOC (val)	mIoU86.6	380
Text-to-Image Retrieval	COCO	Recall@154.8	156
Image-to-Text Retrieval	COCO	R@169.5	152
Image-to-Text Retrieval	Flickr	R@192.5	45
Text-to-Image Retrieval	Flickr	R@182.2	40
Image Classification	ImageNet (val)	Accuracy (%)75.6	27
Monocular Depth Estimation	KITTI official (val)	RMSE3.136	23
Referring Image Segmentation	RefCOCOg UMD (val)	mIoU72	17
Referring Image Segmentation	RefCOCOg UMD (test)	mIoU74.3	16

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord