Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SelecTKD: Selective Token-Weighted Knowledge Distillation for LLMs

About

Knowledge distillation (KD) is a standard route to compress Large Language Models (LLMs) into compact students, yet most pipelines uniformly apply token-wise loss regardless of teacher confidence. This indiscriminate supervision amplifies noisy, high-entropy signals and is especially harmful under large teacher-student capacity gaps. We introduce SelecTKD, a plug-and-play Selective Token-Weighted distillation framework that shifts the focus from "how to measure divergence" to "where to apply learning". At each step, the student proposes tokens that are verified by the teacher through a robust propose-and-verify procedure with two variants: greedy Top-k and non-greedy Spec-k. Accepted tokens receive full loss, while rejected tokens are masked or down-weighted. This objective-agnostic design works with on- and off-policy data, induces an implicit curriculum quantified by Token Acceptance Rate (TAR), and stabilizes optimization. Across instruction following, mathematical reasoning, code generation, and a VLM setting, SelecTKD consistently improves strong baselines and achieves state-of-the-art results for small models without architectural changes or extra reference models.

Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren• 2025

Related benchmarks

TaskDatasetResultRank
Logical reasoningZebraLogic
Accuracy72.1
54
Mathematical ReasoningHMMT 25
Accuracy (HMMT 25)33.54
50
Instruction FollowingIF-Eval
Accuracy49.17
14
Knowledge ReasoningGPQA Diamond
Accuracy58.46
12
CodingLCB v6
Pass@129.71
6
CodingLCB v5
Pass@154.48
6
Preference-based GenerationArena CW
Score35.6
6
General EvaluationLiveBench 1125
Score41.8
6
Medical Reasoning QAMedQA USMLE
Accuracy85.78
6
Medical Reasoning QAMedXpertQA text
Accuracy23.51
6
Showing 10 of 11 rows

Other info

Follow for update