DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

About

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.

Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass• 2023

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER2.9	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER6.7	1206
Automatic Speech Recognition	LibriSpeech (dev-other)	WER6.4	486
Speech Recognition	LibriSpeech clean (dev)	WER0.023	104
Automatic Speech Recognition	Librispeech (test-clean)	WER4.2	96
Speech Processing	Speech Processing Universal PERformance Benchmark (SUPERB) (test)	KS Accuracy96.69	18
Speech Processing	SUPERB	PER3.21	9
Discrete unit quality evaluation	LibriSpeech 960h	ABX7.73	9
Spoken Language Modeling	Libri-Light 6k	sWUGGY (all)60.1	9
Phonetic Discriminability (ABX)	LibriSpeech clean (dev)	ABX (within-speaker)4.05	7

Showing 10 of 16 rows

Other info

Code

Follow for update

@wizwand_team Discord