Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ZIPA: A family of efficient models for multilingual phone recognition

About

We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.

Jian Zhu, Farhan Samir, Eleanor Chodroff, David R. Mortensen• 2025

Related benchmarks

TaskDatasetResultRank
Phone Feature RecognitionBuckeye (sociophonetic)
PFER3.86
25
Phone Feature RecognitionDoreco (unseen languages)
PFER5.8
17
Phone Feature RecognitionL2-Standard (sociophonetic)
PFER1.68
17
Phone Feature RecognitionL2-Perceived sociophonetic
PFER3.63
17
Phone Feature RecognitionVoxAngeles unseen languages
PFER0.65
17
Phone recognitionSeen Languages
English Error Rate (C)0.61
15
Phone recognitionPRiSM Multilingual Datasets
PFER (DRC)16.8
12
Phone recognitionPRiSM Accented English Datasets
PFER (Timing)13.1
12
Phonetic PerceptionDRC-SE (DoReCo South-England)
PFER0.1712
8
Phonetic PerceptionL2-ARCTIC
PFER8.54
8
Showing 10 of 14 rows

Other info

Code

Follow for update