Anticipatory Evaluation of Language Models
About
Progress in large language models is increasingly constrained by an evaluation bottleneck: benchmarks must be built and models run before iteration can begin. We investigate whether evaluation outcomes can be forecast before any experiments are conducted. Specifically, we study text-only performance prediction, where models estimate performance from task descriptions and experimental configurations alone, without access to dataset instances. To support systematic study, we curate PRECOG, a corpus of description-performance pairs spanning diverse tasks, domains, and metrics. We scrape task and configuration descriptions from arXiv, yielding 2,290 instances covering 1,519 papers, and construct a test split using papers published after the evaluated models' knowledge cutoff. Experiments show the task is challenging but feasible: reasoning models achieve a non-trivial forecasting skill reaching mean absolute error as low as 9.9 at high-confidence thresholds. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter resource allocation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Large Model Performance Prediction | OpenCompass 95% masking September 30, 2024 cutoff (temporal split) | RMSE23.33 | 10 | |
| Large Model Performance Prediction | Large Model Performance Prediction 60% masking | RMSE23.55 | 10 | |
| Large Model Performance Prediction | Large Model Performance Prediction dataset 1.0 (40% masking) | RMSE23.6 | 10 | |
| Performance Prediction | Large Model Performance Prediction Dataset 80% masking (test) | RMSE23.5 | 10 |