Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Simple Linear Patch Revives Layer-Pruned Large Language Models

About

Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94.15\%} of the original model's performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4\%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16\% within only 30 minutes on a single GPU. Code is available at https://github.com/chenxinrui-tsinghua/LinearPatch.

Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringARC Challenge
Accuracy43.17
749
Question AnsweringARC Easy
Accuracy64.35
386
Question AnsweringWinoGrande (WG)
Accuracy70.17
98
Question AnsweringPIQA
Accuracy73.23
83
Multiple-choice Question AnsweringHellaSwag
Accuracy69.33
59
Question AnsweringWinoGrande, HellaSwag, ARC-e, ARC-c, PIQA Average
Avg Accuracy62.75
35
Showing 6 of 6 rows

Other info

Follow for update