Predicting LLM Reasoning Performance with Small Proxy Model

About

Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize datasets before scaling up. However, this approach becomes challenging for reasoning capabilities, which exhibit emergent behavior that only appear reliably at larger model sizes, often exceeding 7B parameters. To address this, we introduce rBridge, showing that small proxies ($\leq$1B) can effectively predict large-model reasoning by aligning more closely with (1) the pre-training objective and (2) the target task. rBridge achieves this by weighting negative log-likelihood with task alignment, using reasoning traces from frontier models as gold labels. In our experiments, rBridge (i) reduces dataset ranking costs by over 100x relative to the best baseline, (ii) achieves the strongest correlation across six reasoning benchmarks at 1B to 32B scale, and (iii) zero-shot transfers predictive relationships across pre-training datasets at 1B to 7B scale. These findings indicate that rBridge offers a practical path for exploring reasoning-oriented pre-training at lower cost.

Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, Jamin Shin• 2025

Related benchmarks

Task	Dataset	Result
Performance Estimation	GSM8K	MAE1.751	204
Reasoning	ARC-C	Accuracy (ARC-c)52.254	113
Performance Prediction	Reasoning Benchmarks Average	Train R^20.874	21
Performance Prediction	MATH500	R^2 (Train)0.89	7
Performance Prediction	ARC-C	Train R^20.969	7
Performance Prediction	CQA	Train R^20.89	7
Performance Prediction	MMLU Pro STEM	Train R-squared0.897	6
Performance Prediction	HumanEval	Train R^20.652	6
Model Ranking Prediction	GPQA	Spearman's Rho0.37	6
Model Ranking Prediction	SuperGPQA	Spearman Rank Correlation (rho)0.55	6

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord