Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Distilling to Hybrid Attention Models via KL-Guided Layer Selection

About

Distilling pretrained softmax attention Transformers into more efficient hybrid architectures that interleave softmax and linear attention layers is a promising approach for improving the inference efficiency of LLMs without requiring expensive pretraining from scratch. A critical factor in the conversion process is layer selection, i.e., deciding on which layers to convert to linear attention variants. This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data. Once the layers have been selected we use a recent pipeline for the distillation process itself \citep[RADLADS;][]{goldstein2025radlads}, which consists of attention weight transfer, hidden state alignment, KL-based distribution matching, followed by a small amount of finetuning. We find that this approach is more effective than existing approaches for layer selection, including heuristics that uniformly interleave linear attentions based on a fixed ratio, as well as more involved approaches that rely on specialized diagnostic datasets.

Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, Yoon Kim• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringSQuAD 2.0
F133.352
190
Long-context language modelingRULER
RULER Score0.911
148
Long-context language modeling evaluationFDA (test)
Score0.8004
120
Structured Web Data ExtractionSWDE
Performance91.09
120
Long-context UnderstandingRULER
Score91.1
45
Long-context retrievalNeedle-in-a-Haystack
Retrieval Accuracy100
10
Long-context recallNIAH Single 2
Recall @ 32K Context0.994
4
Long-context recallNIAH Single-3
Recall @ 32K Context99
4
Long-context recallNIAH Single-1
Recall @ 32K99.8
4
Showing 9 of 9 rows

Other info

Follow for update