Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning

About

Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more fine-grained comprehension and alignment. Instance-level understanding is crucial for LMMs, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the SOTA LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning for instance guidance. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, enhanced by Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench and other instance understanding benchmarks, but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our method not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.

Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang• 2024

Related benchmarks

TaskDatasetResultRank
Multi-discipline Multimodal UnderstandingMMMU (val)
Accuracy42.7
167
Chart Question AnsweringChartQA (test)
Accuracy72.8
129
Diagram Question AnsweringAI2D (test)
Accuracy78.7
103
Video Question AnsweringNExT-QA Multi-choice
Accuracy73
102
Video Question AnsweringEgoSchema subset
Accuracy57.8
73
Video UnderstandingVideo-MME without subtitles
Overall Score54
67
Temporal Video UnderstandingTempCompass
Average Score63.9
52
Object Hallucination EvaluationPOPE (test)--
44
Visual Question AnsweringGQA (val)
Accuracy65.9
22
Visual Question AnsweringINST-IT Bench Image
MC Accuracy75.3
16
Showing 10 of 14 rows

Other info

Code

Follow for update