Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

About

Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.

Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu• 2025

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-103 (test)
Perplexity18.3
703
Language ModelingPTB (test)
Perplexity35.2
543
General Language UnderstandingGeneral Ability Suite (MMLU, PIQA, ARC-E, ARC-C, BoolQ, WinoGrande, HellaSwag, TruthfulQA)
MMLU Accuracy65
20
Multi-step Reasoning (Math & Code)Multi-step Reasoning Suite GSM8K MBPP HumanEval
GSM8K Accuracy34
20
Showing 4 of 4 rows

Other info

Follow for update