VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization
About
Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU49.6 | 2731 | |
| Semantic segmentation | PASCAL VOC (val) | mIoU86.6 | 338 | |
| Text-to-Image Retrieval | COCO | Recall@154.8 | 130 | |
| Image-to-Text Retrieval | COCO | R@169.5 | 123 | |
| Text-to-Image Retrieval | Flickr | R@182.2 | 35 | |
| Image Classification | ImageNet (val) | Accuracy (%)75.6 | 27 | |
| Image-to-Text Retrieval | Flickr | R@192.5 | 25 | |
| Monocular Depth Estimation | KITTI official (val) | RMSE3.136 | 23 | |
| Referring Image Segmentation | RefCOCOg UMD (val) | mIoU72 | 17 | |
| Referring Image Segmentation | RefCOCOg UMD (test) | mIoU74.3 | 16 |