Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

About

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.

Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass• 2023

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech (test-other)
WER6.7
966
Automatic Speech RecognitionLibriSpeech clean (test)
WER2.9
833
Automatic Speech RecognitionLibriSpeech (dev-other)
WER6.4
411
Automatic Speech RecognitionLibrispeech (test-clean)
WER4.2
84
Speech RecognitionLibriSpeech clean (dev)
WER0.023
59
Speech ProcessingSpeech Processing Universal PERformance Benchmark (SUPERB) (test)
KS Accuracy96.69
18
Speech ProcessingSUPERB
PER3.21
9
Discrete unit quality evaluationLibriSpeech 960h
ABX7.73
9
Spoken Language ModelingLibri-Light 6k
sWUGGY (all)60.1
9
Phonetic Discriminability (ABX)LibriSpeech clean (dev)
ABX (within-speaker)4.05
7
Showing 10 of 16 rows

Other info

Code

Follow for update