ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking

About

CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model. The resulting ViRC-7B model achieves a 18.8% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.

Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao, Tieru Wu, Rui Ma, Bo Zhang, Zhe Li• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GeoQA (test)	Accuracy75.07	31
Mathematical Reasoning	MMStar Math	Accuracy77.2	26
Mathematical Reasoning	MathVista Math	ALL Accuracy81.11	19
Visual Reasoning	HR-Bench (test)	Accuracy69.94	15
Visual Reasoning	VisualProbe (VP) cross-domain (test)	Accuracy0.4357	15
Visual Reasoning	V* cross-domain (test)	Accuracy79.06	15

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord