A Simple Linear Patch Revives Layer-Pruned Large Language Models
About
Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94.15\%} of the original model's performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4\%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16\% within only 30 minutes on a single GPU. Code is available at https://github.com/chenxinrui-tsinghua/LinearPatch.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | ARC Challenge | Accuracy43.17 | 749 | |
| Question Answering | ARC Easy | Accuracy64.35 | 386 | |
| Question Answering | WinoGrande (WG) | Accuracy70.17 | 98 | |
| Question Answering | PIQA | Accuracy73.23 | 83 | |
| Multiple-choice Question Answering | HellaSwag | Accuracy69.33 | 59 | |
| Question Answering | WinoGrande, HellaSwag, ARC-e, ARC-c, PIQA Average | Avg Accuracy62.75 | 35 |