MUR: Momentum Uncertainty guided Reasoning for Large Language Models
About
Large Language Models have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide current model' test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by by over 45% on average while improving accuracy from 0.33 to 3.46%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Accuracy (Acc)84.4 | 543 | |
| Mathematical Reasoning | AIME 24 | Accuracy73.33 | 318 | |
| Mathematical Reasoning | AIME 2024 (test) | Accuracy53.33 | 209 | |
| Mathematical Reasoning | AIME 2025 (test) | -- | 148 | |
| Mathematical Reasoning | AIME24 | Pass@1 Accuracy36.67 | 117 | |
| Mathematical Reasoning | MATH 500 | Accuracy94 | 79 | |
| Scientific Reasoning | GPQA Diamond | Latency7.29 | 54 | |
| Mathematical Problem Solving | MATH | Average Time4.49 | 39 | |
| Mathematical Problem Solving | AIME 25 | Average Time19.62 | 39 | |
| Mathematical Problem Solving | AIME 24 | Average Time23.24 | 39 |