Universal Automatic Phonetic Transcription into the International Phonetic Alphabet
About
This paper presents a state-of-the-art model for transcribing speech in any language into the International Phonetic Alphabet (IPA). Transcription of spoken languages into IPA is an essential yet time-consuming process in language documentation, and even partially automating this process has the potential to drastically speed up the documentation of endangered languages. Like the previous best speech-to-IPA model (Wav2Vec2Phoneme), our model is based on wav2vec 2.0 and is fine-tuned to predict IPA from audio input. We use training data from seven languages from CommonVoice 11.0, transcribed into IPA semi-automatically. Although this training dataset is much smaller than Wav2Vec2Phoneme's, its higher quality lets our model achieve comparable or better results. Furthermore, we show that the quality of our universal speech-to-IPA models is close to that of human annotators.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Phone Feature Recognition | Buckeye (sociophonetic) | PFER5.94 | 25 | |
| Phone Feature Recognition | VoxAngeles unseen languages | PFER0.62 | 17 | |
| Phone Feature Recognition | Doreco (unseen languages) | PFER6.55 | 17 | |
| Phone Feature Recognition | L2-Standard (sociophonetic) | PFER5.86 | 17 | |
| Phone Feature Recognition | L2-Perceived sociophonetic | PFER5.88 | 17 | |
| Phone recognition | Seen Languages | English Error Rate (C)11.26 | 15 | |
| Phone recognition | PRiSM Multilingual Datasets | PFER (DRC)18.3 | 12 | |
| Phone recognition | PRiSM Accented English Datasets | PFER (Timing)16.3 | 12 |