Joint Audio and Speech Understanding

About

Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals.

Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass• 2023

Related benchmarks

Task	Dataset	Result
Instruction Following	Speech-IFEval	IF Rate29.19	18
Massive Multi-discipline Audio Understanding	MMAU	Speech Score23.35	17
Summarization	iEEG clinical dataset Foreground	ROUGE-L40.2	14
Description	iEEG clinical dataset Background	Avg Score (G, P, T)52.2	14
Summarization	iEEG clinical dataset Background	ROUGE-L27.9	14
Description	iEEG clinical dataset Foreground	AVG(G, P, T)48.9	14
Free Q&A	iEEG clinical dataset Foreground	ROUGE-L39	14
Free Q&A	iEEG clinical dataset Background	ROUGE-L30.4	14
Transcription	iEEG clinical dataset Foreground	WER139.8	13
Transcription	iEEG clinical dataset Background	WER172.6	13

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord