Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

About

Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N:M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In light of this, we propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for activation gradient and FP8 quantization for whole process. Results show that our method surpasses previous 2:4 pre-training recipes and is comparable even with full parameter models. Our toolkit is available at https://github.com/huyz2023/2by4-pretrain.

Yuezhou Hu, Jun Zhu, Jianfei Chen• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet 1k (test)
Top-1 Accuracy78.5
798
Machine TranslationWMT En-De 2014 (test)
BLEU26.11
379
Question AnsweringSQuAD
F185.5
127
Language ModelingGPT-2 Pre-training (val)
Validation Loss2.547
21
Machine TranslationWMT En-De 14 (val)
BLEU26.53
20
Showing 5 of 5 rows

Other info

Follow for update