Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning
About
Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH | Accuracy88.8 | 535 | |
| Mathematical Reasoning | AIME | AIME Accuracy30.5 | 283 | |
| Mathematical Reasoning | AIME 25 | Accuracy26.7 | 201 | |
| Mathematical Reasoning | AMC | Accuracy66.2 | 151 | |
| Multitask Language Understanding | MMLU-Pro | Accuracy63.8 | 99 | |
| Mathematical Reasoning | Olympiad | Accuracy57.2 | 92 | |
| Scientific Reasoning | ARC Challenge | Accuracy91.5 | 56 | |
| Reasoning | Out-of-Domain Reasoning Suite | ARC-c Score94.5 | 29 | |
| Mathematical Reasoning | In-Domain Reasoning Suite | MATH Score91.4 | 9 | |
| General Reasoning | GPQA | Accuracy47.5 | 7 |