Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

About

While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. To adapt LMMs in reasoning with VoCoT, we further construct an instruction-tuning dataset. By combining VoCoT with the prevalent open-source LMM architectures, we develop a VoCoT-based model, VolCano. With only 7B parameters and limited input image resolution, VolCano demonstrates excellent performance across various scenarios. In benchmarks like CLEVR and EmbSpatial, which highly require complex reasoning capabilities, VolCano outperforms SOTA models, including powerful GPT-4V. Related code, data and models are released in https://github.com/RupertLuo/VoCoT.

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, Zhongyu Wei• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationWHU-RS19
Accuracy76.84
70
Image ClassificationAID
Accuracy46.67
66
Visual Question AnsweringRSVQA-HR--
29
Vision-Language ReasoningVRSBench
Accuracy44.63
10
Object PerceptionDOTA (val)
Accuracy20.97
10
Object PerceptionHRRSD
Accuracy41.72
10
Object PerceptionVHR
Accuracy41
10
Object PerceptionVisDrone
Accuracy6.5
10
Showing 8 of 8 rows

Other info

Follow for update