Anticipatory Evaluation of Language Models

About

Progress in large language models is increasingly constrained by an evaluation bottleneck: benchmarks must be built and models run before iteration can begin. We investigate whether evaluation outcomes can be forecast before any experiments are conducted. Specifically, we study text-only performance prediction, where models estimate performance from task descriptions and experimental configurations alone, without access to dataset instances. To support systematic study, we curate PRECOG, a corpus of description-performance pairs spanning diverse tasks, domains, and metrics. We scrape task and configuration descriptions from arXiv, yielding 2,290 instances covering 1,519 papers, and construct a test split using papers published after the evaluated models' knowledge cutoff. Experiments show the task is challenging but feasible: reasoning models achieve a non-trivial forecasting skill reaching mean absolute error as low as 9.9 at high-confidence thresholds. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter resource allocation.

Jungsoo Park, Ethan Mendes, Gabriel Stanovsky, Alan Ritter• 2025

Related benchmarks

Task	Dataset	Result
Large Model Performance Prediction	OpenCompass 95% masking September 30, 2024 cutoff (temporal split)	RMSE23.33	10
Large Model Performance Prediction	Large Model Performance Prediction 60% masking	RMSE23.55	10
Large Model Performance Prediction	Large Model Performance Prediction dataset 1.0 (40% masking)	RMSE23.6	10
Performance Prediction	Large Model Performance Prediction Dataset 80% masking (test)	RMSE23.5	10

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord