Data-Efficient On-Policy Distillation for Automatic Speech Recognition

About

Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

Yu Lin, Yiming Wang, Runyuan Cai, Xiaodong Zeng• 2026

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech Other	WER4.56	140
Automatic Speech Recognition	LibriSpeech Clean	WER2.45	124
Automatic Speech Recognition	AISHELL-1	CER1.95	55
Automatic Speech Recognition	WenetSpeech (meeting)	--	23
Automatic Speech Recognition	WenetSpeech net	Character Error Rate (CER)5.39	19

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord