ZIPA: A family of efficient models for multilingual phone recognition
About
We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Phone Feature Recognition | Buckeye (sociophonetic) | PFER3.86 | 25 | |
| Phone Feature Recognition | Doreco (unseen languages) | PFER5.8 | 17 | |
| Phone Feature Recognition | L2-Standard (sociophonetic) | PFER1.68 | 17 | |
| Phone Feature Recognition | L2-Perceived sociophonetic | PFER3.63 | 17 | |
| Phone Feature Recognition | VoxAngeles unseen languages | PFER0.65 | 17 | |
| Phone recognition | Seen Languages | English Error Rate (C)0.61 | 15 | |
| Phone recognition | PRiSM Multilingual Datasets | PFER (DRC)16.8 | 12 | |
| Phone recognition | PRiSM Accented English Datasets | PFER (Timing)13.1 | 12 | |
| Phonetic Perception | DRC-SE (DoReCo South-England) | PFER0.1712 | 8 | |
| Phonetic Perception | L2-ARCTIC | PFER8.54 | 8 |