Revisiting Multi-Task Visual Representation Learning

About

Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.

Shangzhe Di, Zhonghua Zhai, Weidi Xie• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU48.2	1028
Monocular Depth Estimation	NYU V2	--	174
Text-to-Image Retrieval	COCO	Recall@143	156
Image-to-Text Retrieval	COCO	R@158.1	152
Geometric Correspondence	NAVI	Avg. Recall0.45	8
Semantic Correspondence	SPair	Avg. Recall30.1	8
Relative Depth Estimation	KITTI	AbsRel0.082	8
Relative Depth Estimation	NYU V2	AbsRel5.2	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord