Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

About

Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

Yu Lin, Yiming Wang, Runyuan Cai, Xiaodong Zeng• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech Other
WER4.56
123
Automatic Speech RecognitionLibriSpeech Clean
WER2.45
107
Automatic Speech RecognitionAISHELL-1
CER1.95
55
Automatic Speech RecognitionWenetSpeech (meeting)--
23
Automatic Speech RecognitionWenetSpeech net
Character Error Rate (CER)5.39
19
Showing 5 of 5 rows

Other info

Follow for update