QuarkAudio Technical Report

About

Many existing audio processing and generation models rely on task-specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust instruction and audio understanding and high-quality audio generation. This requires a compatible paradigm design, a powerful backbone, and a high-fidelity audio reconstruction module. To meet these requirements, this technical report introduces QuarkAudio, a decoder-only autoregressive (AR) LM-based generative framework that unifies multiple tasks. The framework includes a unified discrete audio tokenizer, H-Codec, which incorporates self-supervised learning (SSL) representations into the tokenization and reconstruction process. We further propose several improvements to H-Codec, such as a dynamic frame-rate mechanism and extending the audio sampling rate to 48 kHz. QuarkAudio unifies tasks by using task-specific conditional information as the conditioning sequence of the decoder-only LM, and predicting discrete target audio tokens in an AR manner. The framework supports a wide range of audio processing and generation tasks, including speech restoration (SR), target speaker extraction (TSE), speech separation (SS), voice conversion (VC), and language-queried audio source separation (LASS). In addition, we extend downstream tasks to universal free-form audio editing guided by natural language instructions (including speech semantic editing and audio event editing). Experimental results show that H-Codec achieves high-quality audio reconstruction with a low frame rate, improving both the efficiency and performance of downstream audio generation, and that QuarkAudio delivers competitive or comparable performance to state-of-the-art task-specific or multi-task systems across multiple tasks.

Chengwei Liu, Haoyin Yan, Shaofei Xue, Xiaotao Liang, Xiaofu Chen, Bin Gong, Zheng Xue, Gang Song• 2025

Related benchmarks

Task	Dataset	Result
Speech Separation	Libri2Mix (test)	--	68
Speech Reconstruction	Librispeech (test-clean)	UT MOS4.08	64
Audio Reconstruction	AudioSet (eval)	Mel Distance1.1125	63
Speech Reconstruction	Seed-ZH	PESQ2.88	29
Voice Conversion	VCTK	WER3.02	21
Speech Reconstruction	Seed EN	PESQ2.77	12
Packet Loss Concealment	ICASSP PLC-challenge 2022 (test)	PLCMOS Score4.58	9
Target Speaker Extraction	Libri2Mix Clean (test)	DNSMOS SIG3.62	9
Audio Reconstruction	MUSDB18 HQ (test)	Mel Loss0.5035	7
Speech Semantic Editing	Seed-zh (test)	WER11.275	3

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord