Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

About

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu Cheng, Tao Lin• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy88.8
535
Mathematical ReasoningAIME
AIME Accuracy30.5
283
Mathematical ReasoningAIME 25
Accuracy26.7
201
Mathematical ReasoningAMC
Accuracy66.2
151
Multitask Language UnderstandingMMLU-Pro
Accuracy63.8
99
Mathematical ReasoningOlympiad
Accuracy57.2
92
Scientific ReasoningARC Challenge
Accuracy91.5
56
ReasoningOut-of-Domain Reasoning Suite
ARC-c Score94.5
29
Mathematical ReasoningIn-Domain Reasoning Suite
MATH Score91.4
9
General ReasoningGPQA
Accuracy47.5
7
Showing 10 of 11 rows

Other info

GitHub

Follow for update