Distilling an End-to-End Voice Assistant Without Instruction Training Data

About

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using $>$100x less training compute.

William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy67.8	1896
Question Answering	ARC Challenge	Accuracy81.7	906
Physical Commonsense Reasoning	PIQA	Accuracy70	696
Common Sense Reasoning	PIQA	Accuracy80.8	100
Story completion	StoryCloze	Accuracy68.6	80
Commonsense Reasoning	StoryCloze	Accuracy80.9	34
Science Question Answering	ARC-C	Accuracy45.9	32
OpenBook Question Answering	OBQA	Accuracy0.409	32
Multi-task Language Understanding	MMSU	Accuracy60.9	23
General Audio Understanding	VoiceBench	AlpacaEval Score3.67	19

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord