LATTE: Learning to Think with Vision Specialists

About

While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information. To train LATTE, we synthesize and filter a large dataset of 293K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant 4-5% gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.

Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, Silvio Savarese• 2024

Related benchmarks

Task	Dataset	Result
Multi-modal Understanding	MMVet	Accuracy50	55
Visual Reasoning	BLINK	Jigsaw Accuracy75.33	49
Mathematical Reasoning in Vision	MathVista	MathVista Accuracy38.9	48
Multi-modal Understanding	CV-Bench, BLINK, RealWorldQA, MathVista, MMStar, MMVet	Average Score53.8	8
Visual Question Answering (Perception + Reasoning)	MathVista, MMStar, MMVet	MathVista Score46.9	8
Visual Question Answering (Perception)	CV-Bench, BLINK, RealWorldQA	CV-Bench Score60.2	8

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord