Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

About

Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.

Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Ying-Cong Chen• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	BRUMO25	--	62
Mathematical Reasoning	AIME 25	Mean@8 Accuracy73.95	21
Mathematical Reasoning	HMMT25	Avg@8 Score53.33	20
Mathematical Reasoning	AIME 24	Mean@8 Accuracy81.67	9

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord