Block Pruning For Faster Transformers

About

Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments consider classification and generation tasks, yielding among other results a pruned model that is a 2.4x faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.

Fran\c{c}ois Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush• 2021

Related benchmarks

Task	Dataset	Result
Natural Language Understanding	GLUE	SST-292.7	551
Sentiment Analysis	SST-2 (test)	Accuracy93.23	144
Summarization	CNN Daily Mail	ROUGE-141.4	67
Natural Language Inference	MNLI (test)	Accuracy0.837	52
Sentiment Analysis	SST-2 GLUE	F1 Score93.23	45
Pruning	MNLI	Epochs20	5

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord