Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers

About

Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.

Runyuan Cai, Yu Lin, Yiming Wang, Chunlin Fu, Xiaodong Zeng• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER2.52
833
Text-to-SpeechSeed-zh (test)
CER0.84
17
Text-to-SpeechSeed-en (test)
WER1.31
16
Automatic Speech RecognitionAISHELL-1
CER1.93
12
Showing 4 of 4 rows

Other info

Follow for update