VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

About

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction. Code has been released at https://github.com/VITA-MLLM/VITA.

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (test-other)	WER18.4	1447
Automatic Speech Recognition	LibriSpeech clean (test)	WER8.1	1410
Video Understanding	MVBench	Accuracy55.5	635
Automatic Speech Recognition	LibriSpeech (dev-other)	WER16.6	535
Visual Mathematical Reasoning	MathVista	Accuracy66.2	448
Video Question Answering	ActivityNet-QA	Accuracy59.6	438
Automatic Speech Recognition	LibriSpeech (dev-clean)	WER (%)7.6	376
Video Understanding	VideoMME	Score (Overall)56.1	369
Streaming Video Understanding	StreamingBench	Overall70.88	308
Visual Mathematical Reasoning	MathVision	Accuracy19.5	298

Showing 10 of 67 rows

Other info

Follow for update

@wizwand_team Discord