Joint Audio and Speech Understanding
About
Humans are surrounded by audio signals that include both speech and non-speech sounds. The recognition and understanding of speech and non-speech audio events, along with a profound comprehension of the relationship between them, constitute fundamental cognitive capabilities. For the first time, we build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. Specifically, by integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events - almost everything perceivable from audio signals.
Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James Glass• 2023
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Summarization | iEEG clinical dataset Foreground | ROUGE-L40.2 | 14 | |
| Description | iEEG clinical dataset Background | Avg Score (G, P, T)52.2 | 14 | |
| Summarization | iEEG clinical dataset Background | ROUGE-L27.9 | 14 | |
| Description | iEEG clinical dataset Foreground | AVG(G, P, T)48.9 | 14 | |
| Free Q&A | iEEG clinical dataset Foreground | ROUGE-L39 | 14 | |
| Free Q&A | iEEG clinical dataset Background | ROUGE-L30.4 | 14 | |
| Transcription | iEEG clinical dataset Foreground | WER139.8 | 13 | |
| Transcription | iEEG clinical dataset Background | WER172.6 | 13 | |
| Speaker Description | LibriTTS + DEMAND mixtures Foreground | Gender Acc73.7 | 10 | |
| Speaker Description | LibriTTS + DEMAND mixtures Background | Gender Accuracy73.1 | 10 |
Showing 10 of 22 rows