STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction

About

As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module guided by Expectation Violation Theory (EVT) then refines predictions through intra-family analysis, cross-model comparison, and credibility-aware aggregation, producing adjustments with traceable explanations. Extensive experiments show that STAR consistently outperforms all baselines on both score-based and rank-based metrics, delivering a 14.46% gain in total score over the strongest statistical method under extreme sparsity, with only 1--2 observed scores per test model.

Xiaoxiao Wang, Chunxiao Li, Junying Wang, Yijin Guo, Zijian Chen, Chunyi Li, Xiaohong Liu, Zicheng Zhang, Guangtao Zhai• 2026

Related benchmarks

Task	Dataset	Result
Large Model Performance Prediction	OpenCompass 95% masking September 30, 2024 cutoff (temporal split)	RMSE8.75	10
Large Model Performance Prediction	Large Model Performance Prediction 60% masking	RMSE6.77	10
Large Model Performance Prediction	Large Model Performance Prediction dataset 1.0 (40% masking)	RMSE6.13	10
Performance Prediction	Large Model Performance Prediction Dataset 80% masking (test)	RMSE7.5	10
Large Model Performance Prediction	Benchmark-side Pattern Shift Math	Average Score13.01	6
Large Model Performance Prediction	285 models on one Math benchmark	Top-10 Recall82	5
Large Model Performance Prediction	Architecture pattern shift MoE	RMSE10.68	3
Large Model Performance Prediction	Paradigm RLHF pattern shift	RMSE9.55	3
Large Model Performance Prediction	Frontier Top-20 pattern shift	RMSE9.71	3
Large Model Performance Prediction	Benchmark OCR pattern shift	RMSE25.18	3

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord