Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GLM-TTS Technical Report

About

This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).

Jiayan Cui, Zhihan Yang, Naihan Li, Jiankun Tian, Xingyu Ma, Yi Zhang, Guangyu Chen, Runxuan Yang, Yuqing Cheng, Yizhi Zhou, Guochen Yu, Xiaotao Gu, Jie Tang• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechSeed-TTS (eval)
WER1.91
39
Text-to-SpeechChinese standard (test)
CER0.89
21
Text-to-SpeechEnglish (test)
WER0.0191
21
Text-to-SpeechSeed-TTS Seed-ZH (Evaluation)
CER0.89
16
Showing 4 of 4 rows

Other info

Follow for update