Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

About

Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop Nemotron-Cascade, capable of operating in both instruct and deep thinking modes, without any performance gap relative to a thinking-only counterpart. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping• 2025

Related benchmarks

Task	Dataset	Result
Graduate-level Science Reasoning	GPQA	Accuracy68.4	121
Knowledge Reasoning	MMLU-Pro	--	120
Competitive Programming	LiveCodeBench Pro 25Q2	Easy Score77.6	33
Competitive Programming	LiveCodeBench Pro 25Q1	Easy Score75.8	33
Competitive Programming	Codeforces 2501 - 2507	ELO2.12e+3	32
Alignment	IFEval strict prompt	pass@190.2	26
Competitive Programming	LiveCodeBench v5	Score77.5	22
Competitive Programming	LiveCodeBench 2408 - 2505 v6	Score74.6	19
Competitive Programming	LiveCodeBench 2408 - 2505 v6	Pass@178.7	15
Code Generation	LCB 2408-2505	Pass@174.6	11

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord