Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

About

Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We design a three-stage training strategy for modeling both the speech input and output, enabling Freeze-Omni to obtain speech-to-speech conversation ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while achieving low latency end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.

Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, Long Ma• 2024

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER3.82
1207
Automatic Speech RecognitionLibriSpeech (test-other)
WER9.79
1206
Automatic Speech RecognitionWenetSpeech Meeting (test)--
78
Speech-to-Speech Question-AnsweringLlama Questions
Accuracy58.67
27
Speech-to-Text Question-AnsweringWebQ
Accuracy54.1
26
Speech-to-Text Question-AnsweringLlamaQ
Accuracy78.67
26
Automatic Speech RecognitionAISHELL (test)
CER2.48
26
Speech-to-Text Question-AnsweringTriviaQA
Accuracy51.15
26
Speech-to-Speech Question-AnsweringWebQ
Accuracy29.8
25
Speech-to-Speech Question-AnsweringTriviaQA
Accuracy32.36
22
Showing 10 of 95 rows
...

Other info

Follow for update