GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
About
We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through https://github.com/THUDM/GLM-4-Voice and https://huggingface.co/THUDM/glm-4-voice-9b.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER2 | 1207 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER7.66 | 1206 | |
| Emotion Recognition | IEMOCAP | Accuracy22.38 | 151 | |
| Text-to-Speech | Seed-TTS en (test) | WER2.91 | 121 | |
| Automatic Speech Recognition | AISHELL-1 (test) | CER2.46 | 105 | |
| Text-to-Speech | Seed-TTS zh (test) | WER2.1 | 87 | |
| Question Answering | TQA | Accuracy45.7 | 80 | |
| Question Answering | WebQA | -- | 64 | |
| Automatic Speech Recognition | AISHELL-1 | CER1.81 | 55 | |
| Audio Understanding | MMSU | Perception Score11.04 | 37 |