Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

About

The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for comprehensive understanding of scaling properties. This is challenged by: 1) the emergence phenomenon, where unpredictable capabilities appearing suddenly at critical model scales; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby constructing a more stable and predictable task subset that exhibits well-behaved scaling characteristics with the increase of compute budget. We adopt a performance scaling law to predict cluster-wise performance with theoretical support. Predictable subset performance acts as an intermediate predictor for the full evaluation set. We further derive a mapping function to accurately extrapolate the performance of the subset to the full set. Applied to an LLM with 70B parameters, COD achieved a 1.55\% average prediction error across eight key LLM benchmarks, thus providing actionable insights for scaling properties and training monitoring during LLM pre-training.

Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li• 2025

Related benchmarks

Task	Dataset	Result	Rank
Performance Prediction	Performance Prediction Evaluation Suite 70B Model on GSM8k, MATH, BBH, TriviaQA, MBPP, AGIEval, DROP, MMLU-pro (evaluation sets)	Mean Absolute Prediction Error (%)1.55		6

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord