Cautious Weight Decay

About

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	C4 (val)	--	737
Commonsense Reasoning	CommonsenseQA (test)	Accuracy33	62
Commonsense Reasoning	PIQA (test)	Accuracy71	57
Reasoning	MMLU (test)	--	36
Commonsense Reasoning	HELLASWAG (test)	Accuracy41	21
Reasoning	ARC Easy (test)	Normalized Accuracy53	4
Reasoning	ARC Challenge (test)	Normalized Accuracy28	4

Showing 7 of 7 rows

Other info

GitHub

Follow for update

@wizwand_team Discord