Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
About
Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | MSRVTT-QA | -- | 481 | |
| Video Question Answering | MSVD-QA | -- | 340 | |
| Video Question Answering | ActivityNet-QA | -- | 319 | |
| Video Question Answering | ActivityNet-QA (test) | Accuracy37.4 | 275 | |
| Object Hallucination | POPE (Random) | F1 Score88.5 | 200 | |
| Object Hallucination | POPE Adversarial | Accuracy78.8 | 196 | |
| Object Hallucination | POPE Popular | F1 Score86.3 | 188 | |
| Visual Question Answering | GQA (test-dev) | Accuracy63.5 | 178 | |
| Video Question Answering | TGIF-QA | Accuracy51.8 | 147 | |
| Visual Question Answering | VQA v2 (test) | Accuracy78.6 | 131 |