Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

About

Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.

Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu, Lexiang Tang, Haodong Lin, Junru Lu, Jiarui Qin, Lingfeng Qiao, Ruizhi Qiao, Bo Ke, Jianfeng He, Ke Li, Yangning Li, Yunhang Shen, Mengdan Zhang, Peixian Chen, Kun Yin, Bing Liu, Yunfei Wu, Huang Chen, Zhongpeng Cai, Xiaotian Li• 2026

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP48
2454
Semantic segmentationADE20K
mIoU54.2
936
Object DetectionCOCO (val)
mAP47.1
613
Semantic segmentationCityscapes
mIoU70.4
578
Semantic segmentationCOCO Stuff
mIoU52.2
195
Image ClassificationImageNet-ReaL
Precision@189.3
195
Depth EstimationNYU Depth V2--
177
Visual GroundingRefCOCO+ (val)
Accuracy90.1
171
Semantic segmentationPascal Context 59
mIoU60.4
164
Visual GroundingRefCOCO (testB)
Accuracy90.8
125
Showing 10 of 31 rows

Other info

GitHub

Follow for update