iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
About
In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network, including temporal upsampling layers, when directly calculating a raw waveform. Such an approach allows skipping redundant processes during waveform synthesis (e.g., the direct reconstruction of high-dimensional original-scale spectrograms). By contrast, the approach solves all problems in a black box and cannot effectively employ the time-frequency structures existing in a mel-spectrogram. We thus propose iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder with the inverse short-time Fourier transform (iSTFT) after sufficiently reducing the frequency dimension using upsampling layers, reducing the computational cost from black-box modeling and avoiding redundant estimations of high-dimensional spectrograms. During our experiments, we applied our ideas to three HiFi-GAN variants and made the models faster and more lightweight with a reasonable speech quality. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Enhancement | Speech Enhancement (SE) Task (test) | PESQ1.818 | 22 | |
| Speech Synthesis | LibriTTS (ID) | PESQ2.95 | 20 | |
| Neural Vocoding | LibriTTS (test) | PESQ2.88 | 18 | |
| Speech Synthesis | AISHELL3 Mandarin | UTMOS2.351 | 14 | |
| Speech Synthesis | Sound Effect (evaluation) | M-STFT1.54 | 13 | |
| Neural Vocoding | LibriTTS | UTMOS3.564 | 12 | |
| Neural Vocoding | LJSpeech 88 (test) | M-STFT1.188 | 12 | |
| Neural Vocoding | LJSpeech 1.1 (test) | M-STFT1.188 | 12 | |
| Singing Voice Synthesis | OpenSinger (ID) | PESQ3.06 | 9 | |
| Singing Voice Synthesis | M4Singer and Opencpop (OD) | PESQ2.81 | 9 |