PRISM: Demystifying Retention and Interaction in Mid-Training

About

We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.

Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	--	479
Mathematical Problem Solving	AIME 2024	--	113
Scientific Reasoning	GPQA Diamond	Score34.34	68
Science Reasoning	GPQA	GPQA Score52.86	27
Code Generation	LiveCodeBench (LCB)	LCB Score20.79	21
Code Generation	Codeforces (CF)	CF Score20.46	21
General Reasoning	Aggregated Evaluation Suite Coding, Math, Science	Code Average20.38	21
Mathematical Problem Solving	MATH500	MATH500 Score85.88	21
Mathematical Reasoning	AIME 2025	AIME25 Score27.96	16
Coding	LiveCodeBench	LCB Score15.53	14

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord