ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

About

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang• 2026

Related benchmarks

Task	Dataset	Result
Performance Estimation	GSM8K	MAE0.00e+0	204
Performance Estimation	Jigsaw	MAE0.001	198
Performance Estimation	MMLU	MAE0.002	198
Performance Estimation	SVAMP	MAE0.00e+0	198
Performance Estimation	ToxicChat	MAE0.00e+0	198
Performance Estimation	StrategyQA	MAE0.00e+0	197
Performance Estimation	DIVE	MAE0.002	189
Performance Estimation	GQA	MAE0.00e+0	184
Performance Estimation	DICES	MAE0.00e+0	136
Population property estimation	DICES	Bias (MAE)0.001	92

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord