Pay Attention to Small Weights
About
Finetuning large pretrained neural networks is known to be resource-intensive, both in terms of memory and computational cost. To mitigate this, a common approach is to restrict training to a subset of the model parameters. By analyzing the relationship between gradients and weights during finetuning, we observe a notable pattern: large gradients are often associated with small-magnitude weights. This correlation is more pronounced in finetuning settings than in training from scratch. Motivated by this observation, we propose NANOADAM, which dynamically updates only the small-magnitude weights during finetuning and offers several practical advantages: first, this criterion is gradient-free -- the parameter subset can be determined without gradient computation; second, it preserves large-magnitude weights, which are likely to encode critical features learned during pretraining, thereby reducing the risk of catastrophic forgetting; thirdly, it permits the use of larger learning rates and consistently leads to better generalization performance in experiments. We demonstrate this for both NLP and vision tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code-Specific Instruction Tuning Evaluation | Magicoder Evaluation Suite | ARC-C Accuracy52.87 | 48 | |
| Forgetting-aware Instruction Tuning | Magicoder Stability and Plasticity suites (test) | ARC-C52.87 | 36 | |
| Instruction Fine-tuning | MetaMathQA Fine-tuning Evaluation Suite (ARC-C, PIQA, MMLU, HE, GSM8K) (test) | ARC-C Accuracy50.67 | 32 | |
| Instruction Tuning | Magicoder HumanEval | Stability48.61 | 7 |