Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works

About

Knowledge distillation (KD) transfers knowledge from a large teacher model to a smaller student. In language modeling, the student is trained either on tokens sampled from the teacher (hard labels) or the teacher's full next-token distribution (soft labels). Despite soft labels appear strictly richer, we find that mixing hard and soft labels consistently yields better results. Crucially, we show that this gain cannot be explained by closer teacher matching during training. Instead, it comes from reduced exposure bias, the mismatch between training and inference distributions. To explain this phenomenon, we introduce the Bridge-Garden Decomposition theory, which categorizes generation steps into two types: Bridges, where the next token must be exact, and Gardens, where it can be flexible. We show that hard-only KD excels in Bridges by avoiding risky deviations, while soft-only KD preserves diversity in Gardens. A hybrid strategy handles both cases and, as a result, reduces exposure bias across the sequence. Guided by this theory, we develop a family of Bridge-Garden hybrid supervision methods that adaptively balance hard and soft labels. Across a primary suite of seven teacher-student pairs (including Qwen, Llama, Gemma, and DeepSeek) and benchmarks in reasoning and coding, our approach outperforms divergence-based and on-policy KD baselines while reducing training cost by 9.7x, enabling efficient model compression. Code is available at https://github.com/ghwang-s/bridge_garden_hybrid_kd_release.

Guanghui Wang, Kaiwen Lv Kacuila, Zhiyong Yang, Zitai Wang, Jin-Wen Wu, Longtao Huang, Qianqian Xu, Qingming Huang• 2026

Related benchmarks

TaskDatasetResultRank
Multitask Language UnderstandingMMLU
Accuracy35.64
520
Logical reasoningBBH
Accuracy27.44
249
General ReasoningBBH
Accuracy35.11
190
General ReasoningMMLU
MMLU Accuracy61.68
180
Code GenerationHumanEval+ (test)
Pass@138.41
132
ReasoningReasoning Benchmarks BBH, MMLU, ARC-C, ThmQA (test)
BBH46.53
66
ReasoningBBH
BBH Pass@17.78
49
Code GenerationMBPP+
Pass@151.95
40
Code GenerationHumanEval v1 (test)--
37
Code GenerationHumanEval and MBPP EvalPlus--
29
Showing 10 of 23 rows

Other info

Follow for update