Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Joint Audio and Speech Understanding

About

Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals.

Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass• 2023

Related benchmarks

TaskDatasetResultRank
SummarizationiEEG clinical dataset Foreground
ROUGE-L40.2
14
DescriptioniEEG clinical dataset Background
Avg Score (G, P, T)52.2
14
SummarizationiEEG clinical dataset Background
ROUGE-L27.9
14
DescriptioniEEG clinical dataset Foreground
AVG(G, P, T)48.9
14
Free Q&AiEEG clinical dataset Foreground
ROUGE-L39
14
Free Q&AiEEG clinical dataset Background
ROUGE-L30.4
14
TranscriptioniEEG clinical dataset Foreground
WER139.8
13
TranscriptioniEEG clinical dataset Background
WER172.6
13
Speaker DescriptionLibriTTS + DEMAND mixtures Foreground
Gender Acc73.7
10
Speaker DescriptionLibriTTS + DEMAND mixtures Background
Gender Accuracy73.1
10
Showing 10 of 22 rows

Other info

Follow for update