Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

On decoder-only architecture for speech-to-text and large language model integration

About

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA, a novel approach that effectively incorporates acoustic information into text-based large language models. Our method leverages Connectionist Temporal Classification and a simple audio encoder to map the compressed acoustic features to the continuous semantic space of the LLM. In addition, we further probe the decoder-only architecture for speech-to-text tasks by training a smaller scale randomly initialized speech-LLaMA model from speech-text paired data alone. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines, highlighting the potential advantages of decoder-only models for speech-to-text conversion.

Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu• 2023

Related benchmarks

TaskDatasetResultRank
Speech RecognitionIn-house dataset
CER0.031
19
Speech-to-text TranslationCoVoST2 fr-en
BLEU25.2
8
Speech-to-text TranslationCoVoST2 de-en
BLEU27.1
3
Speech-to-text TranslationCoVoST2 zh-en
BLEU12.3
2
Speech-to-text TranslationCoVoST2 es-en
BLEU27.9
2
Speech-to-text TranslationCoVoST2 it-en
BLEU25.9
2
Showing 6 of 6 rows

Other info

Follow for update