Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

About

Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights are publicly available at https://github.com/PKU-YuanGroup/LLaVA-CoT.

Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, Li Yuan• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Score54.8
322
Multimodal ReasoningMM-Vet
MM-Vet Score60.3
281
Multimodal ReasoningMMStar--
81
Multi-discipline Multimodal UnderstandingMMMU-Pro--
56
Chart UnderstandingChartQA (test)
Accuracy67
52
Multimodal ReasoningMMBench--
50
Mathematical ReasoningMathVerse--
39
Document Visual Question AnsweringInfoVQA
ANLS44.8
32
Multidisciplinary KnowledgeMMMU
Score48.9
21
Multidisciplinary KnowledgeMMBench
Score75
20
Showing 10 of 19 rows

Other info

Follow for update