Unlocking Large Audio-Language Models for Interactive Language Learning

About

Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.

Hongfu Liu, Zhouying Cui, Xiangming Gu, Ye Wang• 2026

Related benchmarks

Task	Dataset	Result
Suggestion Generation	L2-Arctic-plus (test)	BLEU-220.4	8
Mispronunciation Detection	L2-Arctic-plus (test)	Precision51.6	8
Pronunciation Training Feedback Generation	L2-Arctic-plus Human Evaluation (12 samples)	SR (Suggestion Relevance)3.8	4
Feedback Generation	L2-Arctic-plus	Average Score2.328	3
Mispronunciation Detection and Suggestion Generation	L2-Arctic-plus	Win Rate96.55	2

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord