Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

About

Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu• 2023

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech Other
WER16.7
75
Automatic Speech RecognitionLibriSpeech Clean
WER11
57
Automatic Speech RecognitionVoxPopuli
WER18.2
27
Automatic Speech RecognitionLS Clean
WER11
25
Automatic Speech RecognitionVoxPopuli 1.0 (test)
Avg WER18.2
14
Text-to-SpeechLibriSpeech Clean
WER14.1
12
Automatic Speech RecognitionCommon Voice en 15
WER19.4
10
Speech-to-Text Question-AnsweringTriviaQA
Accuracy8.2
9
Text-to-SpeechVoxPopuli en V1.0
WER (%)21.3
9
Text-to-SpeechCommon Voice en 15
WER23.2
9
Showing 10 of 29 rows

Other info

Follow for update