SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

About

Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu• 2023

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER18.9	1207
Automatic Speech Recognition	LibriSpeech Other	WER16.7	123
Automatic Speech Recognition	LibriSpeech Clean	WER11	107
Text-to-Speech	LibriSpeech clean (test)	WER24.6	88
Automatic Speech Recognition	VoxPopuli	WER18.2	38
Automatic Speech Recognition	AISHELL-1	WER111.8	31
Text-to-Speech	LibriTTS clean (test)	WER29.1	30
Speech-to-Text Question-Answering	TriviaQA	Accuracy8.2	26
Speech-to-Text Question-Answering	LlamaQ	Accuracy5.4	26
Speech-to-Text Question-Answering	WebQ	Accuracy5.5	26

Showing 10 of 56 rows

Other info

Follow for update

@wizwand_team Discord