Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

About

Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the input narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods.

Ye-Xin Lu, Yang Ai, Hui-Peng Du, Zhen-Hua Ling• 2024

Related benchmarks

TaskDatasetResultRank
Audio Super-ResolutionVCTK In-domain
LSD0.63
34
Audio Bandwidth ExtensionDiverse (ITU-T P.501, ETSI TS 103 281 Annex E, NTT) (evaluation set)
2f-Model Score35.21
28
Bandwidth extensionVCTK 8 kHz to 44.1 kHz (test)
VISQOL3.23
10
Bandwidth extensionTIMIT 8 kHz to 16 kHz (test)
VISQOL2.42
10
Audio Bandwidth Extension (16→48 kHz)VCTK (test)
LSD0.74
5
Audio Bandwidth Extension (12→48 kHz)VCTK (test)
LSD0.81
5
Audio Bandwidth Extension (8→48 kHz)VCTK (test)
LSD0.87
5
Speech ReconstructionEnglish dataset 500 Hz sampling
LSD1.43
5
Speech ReconstructionEnglish dataset 1 kHz sampling
LSD1.31
5
Speech ReconstructionEnglish dataset 2 kHz sampling
LSD1.11
5
Showing 10 of 10 rows

Other info

Follow for update