Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
About
Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals although they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at https://github.com/OpenGVLab/TPO
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Object Tracking | LaSOT (test) | -- | 444 | |
| Visual Object Tracking | GOT-10k (test) | -- | 378 | |
| Video Understanding | MVBench | -- | 247 | |
| Video Understanding | VideoMME | -- | 192 | |
| Moment Retrieval | Charades-STA (test) | R@0.540.2 | 172 | |
| Video Grounding | Charades-STA | R@1 IoU=0.540.2 | 113 | |
| Referring Video Segmentation | Ref-YouTube-VOS | J&F Score63.9 | 91 | |
| Video Understanding | MLVU | M-AVG54.7 | 54 | |
| Referring Video Segmentation | MeViS | J&F Score47 | 50 | |
| Grounded Video Question Answering | NExT-GQA | mIoU27.7 | 28 |