Are discrete units necessary for Spoken Language Modeling?

About

Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, we study the role of discrete versus continuous representations in spoken language modeling. We show that discretization is indeed essential for good results in spoken language modeling. We show that discretization removes linguistically irrelevant information from the continuous features, helping to improve language modeling performances. On the basis of this study, we train a language model on the discrete units of the HuBERT features, reaching new state-of-the-art results in the lexical, syntactic and semantic metrics of the Zero Resource Speech Challenge 2021 (Track 1 - Speech Only).

Tu Anh Nguyen, Benoit Sagot, Emmanuel Dupoux• 2022

Related benchmarks

Task	Dataset	Result
Syntactic knowledge evaluation	sBLIMP ZeroResource Challenge 2021 (dev)	Success Rate59.9	9
Lexical knowledge evaluation	sWUGGY ZeroResource Challenge 2021 (dev)	Success Rate (All)70.9	7
Acoustic unit discovery	ZeroSpeech 2021 (dev)	SS Clean Error3.26	7

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord