MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios

About

In this paper, we propose MDCTCodec, an efficient lightweight end-to-end neural audio codec based on the modified discrete cosine transform (MDCT). The encoder takes the MDCT spectrum of audio as input, encoding it into a continuous latent code which is then discretized by a residual vector quantizer (RVQ). Subsequently, the decoder decodes the MDCT spectrum from the quantized latent code and reconstructs audio via inverse MDCT. During the training phase, a novel multi-resolution MDCT-based discriminator (MR-MDCTD) is adopted to discriminate the natural or decoded MDCT spectrum for adversarial training. Experimental results confirm that, in scenarios with high sampling rates and low bitrates, the MDCTCodec exhibited high decoded audio quality, improved training and generation efficiency, and compact model size compared to baseline codecs. Specifically, the MDCTCodec achieved a ViSQOL score of 4.18 at a sampling rate of 48 kHz and a bitrate of 6 kbps on the public VCTK corpus.

Xiao-Hang Jiang, Yang Ai, Rui-Chen Zheng, Hui-Peng Du, Ye-Xin Lu, Zhen-Hua Ling• 2024

Related benchmarks

Task	Dataset	Result
Speech Coding	LibriTTS 16 kHz (test)	GFLOPs2.28	19
Speech Quality Evaluation	VCTK 48 kHz (test)	STOI0.866	18
Speech Coding	VCTK 48 kHz (test)	RTF (CPU)0.142	12
Neural Speech Coding	LibriTTS 16 kHz (test)	STOI0.912	12
Speech Reconstruction	LibriTTS 16 kHz (test)	ViSQOL3.45	7
Speech Reconstruction	VCTK 48 kHz (test)	ViSQOL3.48	6
Speech Coding	LibriTTS 16 kHz (test-clean)	LSD (dB)0.952	6
Speech Coding	LibriTTS 16 kHz, 0.5 kbps (test)	UTMOS2.67	5
Speech Coding	VCTK 48 kHz, 1.5 kbps (test)	SIGMOS2.846	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord