Streamlining Redundant Layers to Compress Large Language Models

About

This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.Our code is available at https://github.com/RUCKBReasoning/LLM-Streamline

Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen• 2024

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity9.76	3785
Commonsense Reasoning	HellaSwag	Accuracy61.2	1896
Commonsense Reasoning	HellaSwag	HellaSwag Accuracy45.9	711
Physical Commonsense Reasoning	PIQA	Accuracy72	696
Multitask Language Understanding	MMLU	Accuracy45.5	520
Physical Interaction Question Answering	PIQA	Accuracy71.5	415
Diagram Question Answering	AI2D	AI2D Accuracy65.4	387
Mathematical Reasoning	MathVista	Accuracy56.7	382
Chart Question Answering	ChartQA	--	371
Multi-discipline Multimodal Understanding	MMMU	--	363

Showing 10 of 64 rows

Other info

Follow for update

@wizwand_team Discord