Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models

About

While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.

Dehua Tao, Xuan Luo, Daxin Tan, Kai Chen, Lanqing Hong, Jing Li, Ruifeng Xu, Xiao Chen• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER3.58
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER7.75
1151
Automatic Speech RecognitionAISHELL-1 (test)
CER5.46
97
Automatic Speech RecognitionWenetSpeech Meeting (test)
CER18.18
78
Automatic Speech RecognitionWenetSpeech Net (test)
CER16.5
57
Speech-to-Text Question-AnsweringLlamaQ
Accuracy86
23
Speech-to-Text Question-AnsweringTriviaQA
Accuracy71.8
23
Speech-to-Text Question-AnsweringWebQ
Accuracy66.6
23
Speech-to-Speech Question-AnsweringLlama Questions
Accuracy73.33
15
Speech-to-Speech Question-AnsweringTriviaQA
Accuracy50.4
13
Showing 10 of 17 rows

Other info

Follow for update