Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

What Matters in Transformers? Not All Attention is Needed

About

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: \url{https://github.com/Shwai-He/LLM-Drop}.

Shwai He, Guoheng Sun, Zheyu Shen, Ang Li• 2024

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText2
Perplexity14.26
3785
Commonsense ReasoningWinoGrande
Accuracy71.82
1442
Language ModelingPTB
Perplexity22.73
1234
Question AnsweringARC Challenge
Accuracy (ARC)43.6
598
Question AnsweringARC Easy--
597
Multitask Language UnderstandingMMLU
Accuracy44.07
520
Multi-task Language UnderstandingMMLU
MMLU Accuracy42.29
442
Multiple-choice Question AnsweringARC Easy
Accuracy60.23
257
Reading ComprehensionBoolQ
Accuracy (BoolQ)76.7
228
Commonsense ReasoningPIQA
Accuracy73.39
213
Showing 10 of 24 rows

Other info

Follow for update