Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment

About

Automatic pronunciation assessment is an important technology to help self-directed language learners. While pronunciation quality has multiple aspects including accuracy, fluency, completeness, and prosody, previous efforts typically only model one aspect (e.g., accuracy) at one granularity (e.g., at the phoneme-level). In this work, we explore modeling multi-aspect pronunciation assessment at multiple granularities. Specifically, we train a Goodness Of Pronunciation feature-based Transformer (GOPT) with multi-task learning. Experiments show that GOPT achieves the best results on speechocean762 with a public automatic speech recognition (ASR) acoustic model trained on Librispeech.

Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, James Glass• 2022

Related benchmarks

Task	Dataset	Result
Pronunciation Assessment	Speechocean762 (test)	Utterance Acc (PCC)74	30
Phoneme Pronunciation Assessment	speechocean762 official (test)	PCC0.679	24
Utterance-level Pronunciation Assessment	Speechocean762	PCC (Total)0.742	9
Word-level Pronunciation Assessment	Speechocean762	PCC (Total)0.549	7
Sentence Accuracy Assessment	SO762	PCC0.71	5
Sentence Fluency Assessment	SO762	PCC0.75	5
Sentence Prosody Assessment	SO762	PCC0.76	5
Word Accuracy Assessment	SO762	PCC0.53	5
Phoneme Accuracy Assessment	SO762	PCC0.61	4
Utterance Pronunciation Assessment	speechocean762 official (test)	Total Score74.2	4

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord