Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

About

Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.

Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu, Lexiang Tang, Haodong Lin, Junru Lu, Jiarui Qin, Lingfeng Qiao, Ruizhi Qiao, Bo Ke, Jianfeng He, Ke Li, Yangning Li, Yunhang Shen, Mengdan Zhang, Peixian Chen, Kun Yin, Bing Liu, Yunfei Wu, Huang Chen, Zhongpeng Cai, Xiaotian Li• 2026

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP48
2643
Semantic segmentationADE20K
mIoU54.2
1024
Semantic segmentationCityscapes
mIoU70.4
658
Object DetectionCOCO (val)
mAP47.1
633
Semantic segmentationCOCO Stuff
mIoU52.2
379
Visual GroundingRefCOCO+ (val)
Accuracy90.1
212
Image ClassificationImageNet-ReaL
Precision@189.3
211
Depth EstimationNYU Depth V2--
209
Semantic segmentationPascal Context 59
mIoU60.4
204
Visual GroundingRefCOCO (val)
Accuracy93.6
147
Showing 10 of 31 rows

Other info

GitHub

Follow for update