Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
About
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP48 | 2454 | |
| Semantic segmentation | ADE20K | mIoU54.2 | 936 | |
| Object Detection | COCO (val) | mAP47.1 | 613 | |
| Semantic segmentation | Cityscapes | mIoU70.4 | 578 | |
| Semantic segmentation | COCO Stuff | mIoU52.2 | 195 | |
| Image Classification | ImageNet-ReaL | Precision@189.3 | 195 | |
| Depth Estimation | NYU Depth V2 | -- | 177 | |
| Visual Grounding | RefCOCO+ (val) | Accuracy90.1 | 171 | |
| Semantic segmentation | Pascal Context 59 | mIoU60.4 | 164 | |
| Visual Grounding | RefCOCO (testB) | Accuracy90.8 | 125 |