Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

About

Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.

Futing Wang, Jianhao Yan, Yun Luo, Ganqu Cui, Zhi Wang, Xiaoye Qu, Yue Zhang, Yu Cheng, Tao Lin• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH	Accuracy88.8	535
Mathematical Reasoning	AIME	AIME Accuracy30.5	288
Multitask Language Understanding	MMLU-Pro	Accuracy63.8	248
Mathematical Reasoning	AMC	Accuracy66.2	221
Mathematical Reasoning	AIME 25	Accuracy26.7	201
Mathematical Reasoning	Olympiad	Accuracy57.2	137
Scientific Reasoning	ARC Challenge	Accuracy91.5	115
General Reasoning	GPQA	Accuracy47.5	59
Reasoning	Out-of-Domain Reasoning Suite	ARC-c Score94.5	29
Mathematical Reasoning	In-Domain Reasoning Suite	MATH Score91.4	9

Showing 10 of 11 rows

Other info

GitHub

Follow for update

@wizwand_team Discord