Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PRISM: Demystifying Retention and Interaction in Mid-Training

About

We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.

Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024--
104
Scientific ReasoningGPQA Diamond
Score34.34
68
Mathematical Problem SolvingAIME 2024--
62
Science ReasoningGPQA
GPQA Score52.86
27
Code GenerationLiveCodeBench (LCB)
LCB Score20.79
21
Code GenerationCodeforces (CF)
CF Score20.46
21
General ReasoningAggregated Evaluation Suite Coding, Math, Science
Code Average20.38
21
Mathematical Problem SolvingMATH500
MATH500 Score85.88
21
Mathematical ReasoningAIME 2025
AIME25 Score27.96
16
CodingLiveCodeBench
LCB Score15.53
14
Showing 10 of 14 rows

Other info

Follow for update