Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
About
Rapidly developing large language models (LLMs) have brought tremendous intelligent applications. Especially, the GPT-4o's excellent duplex speech interaction ability has brought impressive experience to users. Researchers have recently proposed several multi-modal LLMs in this direction that can achieve user-agent speech-to-speech conversations. This paper proposes a novel speech-text multimodal LLM architecture called Freeze-Omni. Our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM's parameters frozen throughout the training process. We design a three-stage training strategy for modeling both the speech input and output, enabling Freeze-Omni to obtain speech-to-speech conversation ability using text-speech paired data (such as ASR and TTS data) and only 60,000 multi-round text Q&A data on 8 GPUs. Moreover, we can effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level compared with that in the text modality of its backbone LLM, while achieving low latency end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multi-task training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Audio Understanding | VoiceBench | AlpacaEval Score4.03 | 16 | |
| Knowledge Understanding | UltraEvalAudio full-duplex variant | Llama Q.74 | 8 | |
| Interruption Handling | Full-Duplex-Bench | GPT-4o Score3.615 | 6 | |
| Emotional Reasoning | HumDial Challenge Track 1 Task 2-zh (dev) | LLM Score3.63 | 6 | |
| Emotional Trajectory Detection | HumDial Challenge Track 1 Task 1-en (dev) | LLM Score (0-5)2.58 | 6 | |
| Empathetic Response Generation | HumDial Challenge Track 1 Task 3-zh (dev) | LLM Score (0-5)4.02 | 6 | |
| Empathetic Response Generation | HumDial Challenge Track 1 Task 3-en (dev) | LLM Score (0-5)3.66 | 6 | |
| Audio Question Answering | TELEVAL AQA-zh (dev) | TELEVAL Score21.23 | 6 | |
| Audio Question Answering | TELEVAL AQA-en (dev) | TELEVAL Score45.29 | 6 | |
| Emotional Reasoning | HumDial Challenge Track 1 Task 2-en (dev) | LLM Score2.79 | 6 |