Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SODA: Semi On-Policy Black-Box Distillation for Large Language Models

About

Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student's inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model's natural, zero-shot responses are almost strictly inferior to the powerful teacher's targets, we can construct a highly effective contrastive signal simply by pairing the teacher's optimal response with a one-time static snapshot of the student's outputs. This demonstrates that exposing the small student to its own static inferior behaviors is sufficient for high-quality distribution alignment, eliminating the need for costly dynamic rollouts and fragile adversarial balancing. Extensive evaluations across four compact Qwen2.5 and Llama-3 models validate this semi on-policy paradigm. SODA matches or outperforms the state-of-the-art methods on 15 out of 16 benchmark results. More importantly, it achieves this superior distillation quality while training 10 times faster, consuming 27% less peak GPU memory, and completely eliminating adversarial instability.

Xiwen Chen, Jingjing Wang, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hejian Sang, Zhipeng Wang, Alborz Geramifard, Feng Luo• 2026

Related benchmarks

TaskDatasetResultRank
Instruction Following EvaluationLMSYS In-Dist.
GPT-4o Score51.8
17
Instruction Following EvaluationDolly Out-of-Distribution
GPT-4o Score49.9
17
Instruction Following EvaluationSelfInst Out-of-Distribution
GPT-4o Score51.6
17
Instruction Following EvaluationVicuna Out-of-Distribution
GPT-4o Score51.9
17
Showing 4 of 4 rows

Other info

Follow for update