DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
About
In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER6.7 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER2.9 | 833 | |
| Automatic Speech Recognition | LibriSpeech (dev-other) | WER6.4 | 411 | |
| Automatic Speech Recognition | Librispeech (test-clean) | WER4.2 | 84 | |
| Speech Recognition | LibriSpeech clean (dev) | WER0.023 | 59 | |
| Speech Processing | Speech Processing Universal PERformance Benchmark (SUPERB) (test) | KS Accuracy96.69 | 18 | |
| Speech Processing | SUPERB | PER3.21 | 9 | |
| Discrete unit quality evaluation | LibriSpeech 960h | ABX7.73 | 9 | |
| Spoken Language Modeling | Libri-Light 6k | sWUGGY (all)60.1 | 9 | |
| Phonetic Discriminability (ABX) | LibriSpeech clean (dev) | ABX (within-speaker)4.05 | 7 |