Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multimodal Latent Reasoning via Hierarchical Visual Cues Injection

About

The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model's latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.

Yiming Zhang, Qiangyu Yan, Borui Jiang, Kai Han• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Text-based Visual Question AnsweringTextVQA (val)--
262
Visual Question AnsweringSEED-Bench Image
Accuracy70.5
64
OCR VQAChartQA (test)
Accuracy67
22
General Visual Question AnsweringRealworldQA
Score57.9
20
Visual Question AnsweringScienceQA image
Score91.6
17
General Visual Question AnsweringMMBench en (dev)
Overall Score69.6
10
OCR and Chart Visual Question AnsweringDocVQA (val)
Score73.2
7
Showing 8 of 8 rows

Other info

Follow for update