APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

About

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve inference efficiency. Structured pruning improves LM inference efficiency by removing consistent parameter blocks, yet often increases training memory and time. To improve both training and inference efficiency, we introduce APT that adaptively prunes and tunes parameters for the LMs. At the early stage of fine-tuning, APT dynamically adds salient tuning parameters for fast and accurate convergence while discarding unimportant parameters for efficiency. Compared to baselines, our experiments show that APT maintains up to 98% task performance when pruning RoBERTa and T5 models with 40% parameters left while keeping 86.4% LLaMA models' performance with 70% parameters remained. Furthermore, APT speeds up LMs fine-tuning by up to 8x and reduces large LMs memory training footprint by up to 70%.

Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy92.6	1896
Commonsense Reasoning	WinoGrande	Accuracy81.5	1442
Medical Question Answering	MedMCQA	Accuracy60.7	521
Physical Interaction Question Answering	PIQA	Accuracy85.9	415
Question Answering	ARC	Accuracy89.1	230
Question Answering	PubMedQA	Accuracy56.1	145
Summarization	BillSum	Accuracy64.5	28
Financial NLP	FinGPT	Accuracy81.3	28
Efficiency Evaluation	Model Efficiency Benchmarking Llama3.1-8B	Training Time158	11

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord