Revisiting Multi-Task Visual Representation Learning
About
Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K | mIoU48.2 | 936 | |
| Text-to-Image Retrieval | COCO | Recall@143 | 130 | |
| Image-to-Text Retrieval | COCO | R@158.1 | 123 | |
| Monocular Depth Estimation | NYU V2 | -- | 113 | |
| Geometric Correspondence | NAVI | Avg. Recall0.45 | 8 | |
| Semantic Correspondence | SPair | Avg. Recall30.1 | 8 | |
| Relative Depth Estimation | KITTI | AbsRel0.082 | 8 | |
| Relative Depth Estimation | NYU V2 | AbsRel5.2 | 8 |