Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Prophetic Decoding to Unlock Visual Search in LVLMs

About

Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.

Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringScienceQA
Accuracy85.4
446
Visual SearchV* Benchmark
Overall Success Rate91.1
54
Visual Question AnsweringOCRBench
Score85.3
53
Fine-grained visual searchHR-Bench-8K
Overall Score75.2
24
Visual SearchVisualProbe (test)
Success Rate (Easy)71.3
22
High-resolution Visual SearchHR-Bench-4K
Overall Score78.6
11
Visual Question AnsweringCV-Bench
Accuracy78.4
7
Visual Question AnsweringMME RealWorld
Accuracy67.7
4
Showing 8 of 8 rows

Other info

Follow for update