EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
About
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy81.4 | 1117 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER2.9 | 833 | |
| Multimodal Evaluation | MME | Score2.40e+3 | 557 | |
| OCR Evaluation | OCRBench | Score843 | 296 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy59.7 | 266 | |
| Science Question Answering | ScienceQA IMG | Accuracy98.2 | 256 | |
| Visual Question Answering | ChartQA | Accuracy88.7 | 239 | |
| Multimodal Model Evaluation | MMBench | Accuracy86.4 | 180 | |
| Visual Question Answering | AI2D | Accuracy85.8 | 174 | |
| Multimodal Evaluation | MM-Vet | Accuracy64.8 | 122 |