Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

About

Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.

Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	ChartQA	Accuracy81.82	620
Visual Perception	BLINK	Accuracy41.56	255
Chart Question Answering	ChartQA	Accuracy83.12	165
Multimodal Understanding	MMMU	Accuracy52.56	107
Visual Question Answering	MMMU	Accuracy60.92	101
Visual Question Answering	BLINK	Accuracy48.39	27
Vision-Language Hallucination Assessment	MM-HallBench	Average Score3.64	8
Open-ended generation	CapArena	Average Score11.83	7
Text-only QA	MMLU	Accuracy73.21	7

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord