Study of Training Dynamics for Memory-Constrained Fine-Tuning
About
Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 (test) | -- | 3518 | |
| Image Classification | CIFAR-10 (test) | -- | 3381 | |
| Image Classification | Flowers (test) | Accuracy90 | 87 | |
| Image Classification | Food (test) | Accuracy84.76 | 50 | |
| Image Classification | Pets (test) | -- | 36 | |
| Image Classification | CUB (test) | Top-1 Accuracy75.89 | 31 | |
| Natural Language Understanding | GLUE (test) | QNLI Score91.36 | 26 | |
| Visual Wake Words Classification | Visual Wake Words (test) | Accuracy93.83 | 21 | |
| Image Classification | Average (CIFAR, CUB, Flowers, Food, Pets, VWW) | Top-1 Accuracy88.48 | 13 | |
| Image Classification | VWW (test) | Top-1 Accuracy88.76 | 2 |