ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

About

Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary rather than isomorphic modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on approximately 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7 percent over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMStar	Accuracy56.9	511
Chart Understanding	ChartQA	Accuracy78.1	159
Multimodal Visual Perception	MMVP	Accuracy78.33	106
Visual Search	V*	Accuracy67	53
Vision Understanding	MMVP	Accuracy80.33	45
Multi-view spatial reasoning	MindCube	Accuracy39.2	37
Visual Search	HR-Bench-4K	Accuracy54	37
Vision-centric Evaluation	CV-Bench	Accuracy0.8086	34
Multimodal Reasoning	BLINK	Accuracy59.49	33
Visual Search	HR-Bench-8K	Accuracy46.7	29

Showing 10 of 50 rows

Other info

GitHub

Follow for update

@wizwand_team Discord