Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

About

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin• 2024

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA--
481
Video Question AnsweringMSVD-QA--
340
Video Question AnsweringActivityNet-QA--
319
Video Question AnsweringActivityNet-QA (test)
Accuracy37.4
275
Object HallucinationPOPE (Random)
F1 Score88.5
200
Object HallucinationPOPE Adversarial
Accuracy78.8
196
Object HallucinationPOPE Popular
F1 Score86.3
188
Visual Question AnsweringGQA (test-dev)
Accuracy63.5
178
Video Question AnsweringTGIF-QA
Accuracy51.8
147
Visual Question AnsweringVQA v2 (test)
Accuracy78.6
131
Showing 10 of 20 rows

Other info

Follow for update