On decoder-only architecture for speech-to-text and large language model integration

About

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu• 2023

Related benchmarks

Task	Dataset	Result
Speech Recognition	In-house dataset	CER0.031	19
Speech-to-text Translation	CoVoST2 zh-en	BLEU12.3	12
Speech-to-text Translation	CoVoST2 fr-en	BLEU25.2	8
Speech-to-text Translation	CoVoST2 de-en	BLEU27.1	3
Speech-to-text Translation	CoVoST2 es-en	BLEU27.9	2
Speech-to-text Translation	CoVoST2 it-en	BLEU25.9	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord