GLM-TTS Technical Report

About

This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).

Jiayan Cui, Zhihan Yang, Naihan Li, Jiankun Tian, Xingyu Ma, Yi Zhang, Guangyu Chen, Runxuan Yang, Yuqing Cheng, Yizhi Zhou, Guochen Yu, Xiaotao Gu, Jie Tang• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS (eval)	WER1.91	39
Text-to-Speech	Chinese standard (test)	CER0.89	21
Text-to-Speech	English (test)	WER0.0191	21
Text-to-Speech	Seed-TTS Seed-ZH (Evaluation)	CER0.89	16
Monologue Text-to-Speech	SwanBench-Speech Expressive Challenge	Timbre0.94	11

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord