Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unlocking Large Audio-Language Models for Interactive Language Learning

About

Achieving pronunciation proficiency in a second language (L2) remains a challenge, despite the development of Computer-Assisted Pronunciation Training (CAPT) systems. Traditional CAPT systems often provide unintuitive feedback that lacks actionable guidance, limiting its effectiveness. Recent advancements in audio-language models (ALMs) offer the potential to enhance these systems by providing more user-friendly feedback. In this work, we investigate ALMs for chat-based pronunciation training by introducing L2-Arctic-plus, an English dataset with detailed error explanations and actionable suggestions for improvement. We benchmark cascaded ASR+LLMs and existing ALMs on this dataset, specifically in detecting mispronunciation and generating actionable feedback. To improve the performance, we further propose to instruction-tune ALMs on L2-Arctic-plus. Experimental results demonstrate that our instruction-tuned models significantly outperform existing baselines on mispronunciation detection and suggestion generation in terms of both objective and human evaluation, highlighting the value of the proposed dataset.

Hongfu Liu, Zhouying Cui, Xiangming Gu, Ye Wang• 2026

Related benchmarks

TaskDatasetResultRank
Suggestion GenerationL2-Arctic-plus (test)
BLEU-220.4
8
Mispronunciation DetectionL2-Arctic-plus (test)
Precision51.6
8
Pronunciation Training Feedback GenerationL2-Arctic-plus Human Evaluation (12 samples)
SR (Suggestion Relevance)3.8
4
Feedback GenerationL2-Arctic-plus
Average Score2.328
3
Mispronunciation Detection and Suggestion GenerationL2-Arctic-plus
Win Rate96.55
2
Showing 5 of 5 rows

Other info

Follow for update