Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

About

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu• 2026

Related benchmarks

Task	Dataset	Result
Chart Understanding and Reasoning	ChartQA	Accuracy86.1	87
Multi-modal Video Understanding	VideoMME	Accuracy67.8	64
Visual Reasoning	MMStar	Accuracy68.2	51
Visual Reasoning	MMBench	--	48
Visual Perception	AI2D	Accuracy81.7	47
Image Understanding	TextVQA	Accuracy80.6	43
Visual Reasoning	MMMU (val)	Accuracy64.8	22
Visual Perception	OCRBench	Score830	22
Visual Reasoning	MathVista mini (test)	Accuracy77.6	21
Visual Perception	MME	Perception Score2.41e+3	20

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord