Beyond Acoustic Sparsity and Linguistic Bias: A Prompt-Free Paradigm for Mispronunciation Detection and Diagnosis
About
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cues, while explicit canonical priors bias predictions toward intended targets. To address these bottlenecks, we propose a prompt-free framework decoupling acoustic fidelity from canonical guidance. First, we introduce CROTTC, an acoustic model enforcing monotonic, frame-level alignment to accurately capture pronunciation deviations. Second, we implicitly inject mispronunciation information via the IF strategy under the knowledge transfer principle. Experiments show CROTTC-IF achieves a 71.77% F1-score on L2-ARCTIC and 71.70% F1-score on the Iqra'Eval2 leaderboard. With empirical analysis, we demonstrate that decoupling acoustics from explicit priors provides highly robust MDD.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mispronunciation Detection | L2-ARCTIC (test) | F1 Score71.77 | 20 | |
| Mispronunciation Diagnosis | L2-ARCTIC (test) | EDR21.98 | 14 | |
| Phoneme Recognition | L2-ARCTIC (test) | Phoneme Error Rate (PER)15.42 | 14 | |
| Mispronunciation Detection and Diagnosis | ERJ (test) | F1 Score89.27 | 6 | |
| Mispronunciation Detection and Diagnosis | SO762 (test) | F1 Score57.16 | 6 | |
| Mispronunciation Detection and Diagnosis | Iqra’Eval2 Leaderboard (test) | F1-score71.7 | 5 |