Self-Prophetic Decoding to Unlock Visual Search in LVLMs
About
Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | ScienceQA | Accuracy85.4 | 446 | |
| Visual Search | V* Benchmark | Overall Success Rate91.1 | 54 | |
| Visual Question Answering | OCRBench | Score85.3 | 53 | |
| Fine-grained visual search | HR-Bench-8K | Overall Score75.2 | 24 | |
| Visual Search | VisualProbe (test) | Success Rate (Easy)71.3 | 22 | |
| High-resolution Visual Search | HR-Bench-4K | Overall Score78.6 | 11 | |
| Visual Question Answering | CV-Bench | Accuracy78.4 | 7 | |
| Visual Question Answering | MME RealWorld | Accuracy67.7 | 4 |