GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

About

We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through https://github.com/THUDM/GLM-4-Voice and https://huggingface.co/THUDM/glm-4-voice-9b.

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, Jie Tang• 2024

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER2	1207
Automatic Speech Recognition	LibriSpeech (test-other)	WER7.66	1206
Emotion Recognition	IEMOCAP	Accuracy22.38	151
Text-to-Speech	Seed-TTS en (test)	WER2.91	121
Automatic Speech Recognition	AISHELL-1 (test)	CER2.46	105
Text-to-Speech	Seed-TTS zh (test)	WER2.1	87
Question Answering	TQA	Accuracy45.7	80
Question Answering	WebQA	--	64
Automatic Speech Recognition	AISHELL-1	CER1.81	55
Audio Understanding	MMSU	Perception Score11.04	37

Showing 10 of 127 rows

...

Other info

Follow for update

@wizwand_team Discord