LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
About
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Model Extrapolation | Pythia k=3 (1B, 410M, 160M) | Pooled R^20.605 | 8 | |
| Model Extrapolation | Pythia k=4 (≤2.8B) | Pooled R^20.837 | 8 | |
| Model Extrapolation | Pythia ≤6.9B (k=5) | Pooled R^20.847 | 8 | |
| Scaling Law Modeling | Pythia AWQ 4-bit | R2 Score0.9935 | 8 | |
| Scaling Law Modeling | Pythia bnb 4-bit | R2 Score99.36 | 8 | |
| Scaling Law Modeling | Pythia quanto 2-bit | R2 Score0.9031 | 8 | |
| Token Extrapolation | Pythia Predict 75.5B–307B | Pooled R280.5 | 8 | |
| Token Extrapolation | Pythia Predict 180.4B–307B | Pooled R20.781 | 8 | |
| Token Extrapolation | Pythia Predict 272.6B–307B | Pooled R20.945 | 8 | |
| Scaling Law Fitting | Pythia Suite | Performance (4-bit)99.53 | 7 |