Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models
About
While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for extending pre-trained Visual-Language (VL) backbones with speech understanding and generation capabilities, while fully preserving the backbones' vision-language performance. Specifically, the VL backbone is equipped with two lightweight, trainable plug-and-play modules, a speech projector and a speech token generator, while keeping the VL backbone fully frozen. To mitigate the scarcity of spoken QA corpora, a low-cost data construction strategy is proposed to generate Question-Text Answer-Text-Speech (QTATS) data from existing ASR speech-text pairs, facilitating effective speech generation training. Experimental results show that, even with only thousands of hours of speech training data, Speech-Omni-Lite achieves excellent spoken QA performance, which is comparable to omni-models trained on millions of hours of speech data. Furthermore, the learned speech modules exhibit strong transferability across VL backbones.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech clean (test) | WER3.58 | 1156 | |
| Automatic Speech Recognition | LibriSpeech (test-other) | WER7.75 | 1151 | |
| Automatic Speech Recognition | AISHELL-1 (test) | CER5.46 | 97 | |
| Automatic Speech Recognition | WenetSpeech Meeting (test) | CER18.18 | 78 | |
| Automatic Speech Recognition | WenetSpeech Net (test) | CER16.5 | 57 | |
| Speech-to-Text Question-Answering | LlamaQ | Accuracy86 | 23 | |
| Speech-to-Text Question-Answering | TriviaQA | Accuracy71.8 | 23 | |
| Speech-to-Text Question-Answering | WebQ | Accuracy66.6 | 23 | |
| Speech-to-Speech Question-Answering | Llama Questions | Accuracy73.33 | 15 | |
| Speech-to-Speech Question-Answering | TriviaQA | Accuracy50.4 | 13 |