Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs

About

Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or detecting key moments in long videos. Existing methods typically rely on complex, task-specific fine-tuning, which reduces generalizability and increases system complexity. In this work, we propose an effective, training-free framework that uses an MLLM's intrinsic uncertainty as proactive guidance. Our core insight is that a model's uncertainty decreases when provided with relevant visual information. We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most informative data. We apply this simple principle to three challenging visual tasks: Visual Search, Long Video Understanding, and Temporal Grounding, allowing off-the-shelf MLLMs to achieve performance competitive with specialized, fine-tuned systems. Our results demonstrate that leveraging intrinsic uncertainty is a powerful strategy for improving fine-grained multimodal performance.

Sanghwan Kim, Rui Xiao, Stephan Alaniz, Yongqin Xian, Zeynep Akata• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringV*Bench
Accuracy86.9
84
Visual Question AnsweringHRBench 4K
Accuracy0.749
54
Visual Question AnsweringHRBench-8K
Accuracy68.9
51
Multimodal Large Language Model EvaluationMME RealWorld
Reasoning35.1
5
Showing 4 of 4 rows

Other info

Follow for update