Scaling Transformers for Low-Bitrate High-Quality Speech Coding
About
The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of $400$ or $700$ bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.
Julian D Parker, Anton Smirnov, Jordi Pons, CJ Carr, Zack Zukowski, Zach Evans, Xubo Liu• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER11.8 | 833 | |
| Speech Reconstruction | LibriTTS clean (test) | PESQ1.787 | 50 | |
| Speech Reconstruction | Librispeech (test-clean) | STOI0.91 | 49 | |
| Image Reconstruction | ImageNet | PSNR24.8198 | 43 | |
| Text-to-Speech | Seed-TTS (eval) | WER10.9 | 39 | |
| Audio Reconstruction | MusicDB (test) | -- | 28 | |
| Speech Reconstruction | LibriSpeech English (test-clean) | SIM0.62 | 27 | |
| Speech Reconstruction | AISHELL-2 Chinese | SIM0.45 | 27 | |
| Image Reconstruction | COCO (test) | CVU0.8607 | 24 | |
| Audio Reconstruction | LibriSpeech (test-clean test-other) | CVU0.096 | 21 |
Showing 10 of 21 rows