Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Anticipatory Evaluation of Language Models

About

Progress in large language models is increasingly constrained by an evaluation bottleneck: benchmarks must be built and models run before iteration can begin. We investigate whether evaluation outcomes can be forecast before any experiments are conducted. Specifically, we study text-only performance prediction, where models estimate performance from task descriptions and experimental configurations alone, without access to dataset instances. To support systematic study, we curate PRECOG, a corpus of description-performance pairs spanning diverse tasks, domains, and metrics. We scrape task and configuration descriptions from arXiv, yielding 2,290 instances covering 1,519 papers, and construct a test split using papers published after the evaluated models' knowledge cutoff. Experiments show the task is challenging but feasible: reasoning models achieve a non-trivial forecasting skill reaching mean absolute error as low as 9.9 at high-confidence thresholds. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter resource allocation.

Jungsoo Park, Ethan Mendes, Gabriel Stanovsky, Alan Ritter• 2025

Related benchmarks

TaskDatasetResultRank
Large Model Performance PredictionOpenCompass 95% masking September 30, 2024 cutoff (temporal split)
RMSE23.33
10
Large Model Performance PredictionLarge Model Performance Prediction 60% masking
RMSE23.55
10
Large Model Performance PredictionLarge Model Performance Prediction dataset 1.0 (40% masking)
RMSE23.6
10
Performance PredictionLarge Model Performance Prediction Dataset 80% masking (test)
RMSE23.5
10
Showing 4 of 4 rows

Other info

Follow for update