On Predicting the Post-training Potential of Pre-trained LLMs

About

The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to inefficient model selection. We address this by introducing a new task of predicting post-training potential - forecasting a base model's performance before post-training. We propose RuDE (Rubric-based Discriminative Evaluation), a unified framework that bypasses the generation gap of base models by leveraging response discrimination. Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations. Extensive experiments demonstrate a correlation greater than 90% with post-training performance. Crucially, validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development.

Xiaoyuan Li, Yubo Ma, Kexin Yang, Moxin Li, Keqin Bao, Wenie Wang, Fuli Feng, Dayiheng Liu• 2026

Related benchmarks

Task	Dataset	Result
Discriminative Evaluation	RuDE base	--	16
Generative Performance	AdvancedIF	Pearson r0.91	1
Generative Performance	HealthBench	Pearson r0.67	1
Generative Performance	WritingBench	Pearson r0.62	1
Generative Performance	PRBench	Pearson r0.8	1

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord