Cautious Weight Decay
About
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | C4 (val) | -- | 392 | |
| Commonsense Reasoning | PIQA (test) | Accuracy71 | 46 | |
| Commonsense Reasoning | CommonsenseQA (test) | Accuracy33 | 41 | |
| Commonsense Reasoning | HELLASWAG (test) | Accuracy41 | 21 | |
| Reasoning | ARC Easy (test) | Normalized Accuracy53 | 4 | |
| Reasoning | ARC Challenge (test) | Normalized Accuracy28 | 4 | |
| Reasoning | MMLU (test) | Accuracy26 | 4 |