Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

About

Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding, yet still struggle with inaccurate evidence localization. To address these limitations, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies context and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we 1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that include frame identification, evidence reasoning, and action decision, and 2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to progressively incentivize multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long video understanding tasks, validating its strong scalability and robustness.

Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun• 2025

Related benchmarks

Task	Dataset	Result
Video Question Answering	VideoMME	--	251
Video Question Answering	LongVideoBench	Accuracy56.6	210
Video Understanding	LongVideoBench	--	123
Video Understanding	MLVU	Accuracy59.2	114
Video Understanding	MMVU	Accuracy64	76
Video Understanding	LVBench	--	75
Temporal Grounding	Charades-STA (test)	--	68
Video Question Answering	MLVU	M-Avg Score63.4	40
Video Question Answering	LVBench	Overall Score39.2	38
Video Understanding	Video-MME w/o sub	Accuracy55.5	33

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord