Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
About
The scaling laws guiding modern model training were calibrated for a single regime: data-rich, single-epoch pretraining. The dominant such scaling law form, Chinchilla's $L = E + A/N^\alpha + B/D^\beta$, has three structural limitations outside that regime: it diverges as unique data shrinks instead of saturating at the uninformed baseline; it cannot represent overfitting when capacity exceeds the data; and it conflates total examples seen with unique examples available. We propose a closed-form extension, $L(N, D, T) = E + (L_0 - E)\,h/(1+h)$ with $h = a/N^\alpha + b/T^\beta + c\,N^\gamma/D^\delta$, that decomposes loss into undercapacity, undertraining, and overfitting terms. It saturates between the irreducible loss $E$ and an uninformed baseline $L_0$ fixed by the loss type, and reduces to Chinchilla in the data-rich, single-epoch limit. We validate it on four multi-epoch experiments spanning four architecture families (MLPs, ResNets, Fourier neural operators, and transformers) across vision, scientific ML, and language domains, and refit it to five published LLM scaling-law grids. Extrapolating to higher compute and larger unique data than seen at fit time, our form achieves state-of-the-art RMSE on every published LLM grid we evaluate and on most cells of our constructed experiments. Once calibrated, the form admits a cost-aware allocation that recovers Chinchilla's optimum when data is free and shifts toward smaller corpora and more epochs as data grows expensive.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scaling-law extrapolation | MNIST high-C holdout | RMSE (log space)0.127 | 6 | |
| Scaling-law extrapolation | CIFAR-100 high-C holdout | RMSE (log space)0.081 | 6 | |
| Scaling-law extrapolation | Darcy high-C (holdout) | RMSE (log space)0.168 | 6 | |
| Scaling-law extrapolation | Chinchilla grid (high-C holdout) | RMSE (log space)0.007 | 6 | |
| Scaling-law extrapolation | Muennighoff grid high-C holdout | RMSE (log space)0.059 | 6 | |
| Scaling-law extrapolation | Gadre grid high-C holdout | RMSE (log space)0.014 | 6 | |
| Scaling-law extrapolation | Porian grid high-C (holdout) | RMSE (log space)0.063 | 6 | |
| Scaling-law extrapolation | Farseer grid (high-C holdout) | RMSE (log space)0.008 | 6 | |
| Scaling-law extrapolation | CIFAR-100 high-D holdout | RMSE (log space)0.069 | 6 | |
| Scaling-law extrapolation | Darcy high-D (holdout) | RMSE (log space)0.17 | 6 |