K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function
About
Evaluating young children's language is challenging for automatic speech recognizers due to high-pitched voices, prolonged sounds, and limited data. We introduce K-Function, a framework that combines accurate sub-word transcription with objective, Large Language Model (LLM)-driven scoring. Its core, Kids-Weighted Finite State Transducer (K-WFST), merges an acoustic phoneme encoder with a phoneme-similarity model to capture child-specific speech errors while remaining fully interpretable. K-WFST achieves a 1.39 % phoneme error rate on MyST and 8.61 % on Multitudes-an absolute improvement of 10.47 % and 7.06 % over a greedy-search decoder. These high-quality transcripts are used by an LLM to grade verbal skills, developmental milestones, reading, and comprehension, with results that align closely with human evaluators. Our findings show that precise phoneme recognition is essential for creating an effective assessment framework, enabling scalable language screening for children.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| LLM-Assisted Scoring | Multitudes ORF corpus 1.87-hour (test) | MAE0.0843 | 6 | |
| Phoneme Recognition | MyST (test) | PER0.0139 | 6 |