Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cautious Weight Decay

About

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingC4 (val)--
392
Commonsense ReasoningPIQA (test)
Accuracy71
46
Commonsense ReasoningCommonsenseQA (test)
Accuracy33
41
Commonsense ReasoningHELLASWAG (test)
Accuracy41
21
ReasoningARC Easy (test)
Normalized Accuracy53
4
ReasoningARC Challenge (test)
Normalized Accuracy28
4
ReasoningMMLU (test)
Accuracy26
4
Showing 7 of 7 rows

Other info

GitHub

Follow for update