Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

About

We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through https://github.com/THUDM/GLM-4-Voice and https://huggingface.co/THUDM/glm-4-voice-9b.

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, Jie Tang• 2024

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER2
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER7.66
1151
Emotion RecognitionIEMOCAP
Accuracy22.38
115
Automatic Speech RecognitionAISHELL-1 (test)
CER2.46
97
Text-to-SpeechSeed-TTS en (test)
WER2.91
90
Question AnsweringTQA
Accuracy45.7
74
Text-to-SpeechSeed-TTS zh (test)
WER2.1
65
Automatic Speech RecognitionAISHELL-1
CER1.81
50
Question AnsweringWebQA--
40
Audio-Grounded Character Role-playingAudioRole-Demo
AP Rank1.39
32
Showing 10 of 97 rows
...

Other info

Follow for update