VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

About

While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. To adapt LMMs in reasoning with VoCoT, we further construct an instruction-tuning dataset. By combining VoCoT with the prevalent open-source LMM architectures, we develop a VoCoT-based model, VolCano. With only 7B parameters and limited input image resolution, VolCano demonstrates excellent performance across various scenarios. In benchmarks like CLEVR and EmbSpatial, which highly require complex reasoning capabilities, VolCano outperforms SOTA models, including powerful GPT-4V. Related code, data and models are released in https://github.com/RupertLuo/VoCoT.

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, Zhongyu Wei• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	WHU-RS19	Accuracy76.84	104
Image Classification	AID	Accuracy46.67	83
Visual Question Answering	RSVQA-HR	--	38
Vision-Language Reasoning	VRSBench	Accuracy44.63	10
Object Perception	DOTA (val)	Accuracy20.97	10
Object Perception	HRRSD	Accuracy41.72	10
Object Perception	VHR	Accuracy41	10
Object Perception	VisDrone	Accuracy6.5	10
Multimodal Spatial Reasoning	VSR	Accuracy68.88	8
Multimodal Spatial Reasoning	V-Star	Accuracy59.87	8

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord