Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Listen, Think, and Understand

About

The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into general categories, but also to listen to the finer details of the sounds, explain the reason for the predictions, think about what the sound infers, and understand the scene and what action needs to be taken, if any. Such capabilities beyond perception are not yet present in existing audio models. On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities. Therefore, we ask the question: can we build a model that has both audio perception and a reasoning ability? In this paper, we propose a new audio foundation model, called LTU (Listen, Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse (audio, question, answer) tuples, and have used an autoregressive training framework with a perception-to-understanding curriculum. LTU demonstrates strong performance and generalization ability on conventional audio tasks such as classification and captioning. More importantly, it exhibits emerging audio reasoning and comprehension abilities that are absent in existing audio models. To the best of our knowledge, LTU is one of the first multimodal large language models that focus on general audio (rather than just speech) understanding.

Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass• 2023

Related benchmarks

TaskDatasetResultRank
ClassificationAudioSet (test)
mAP18.7
57
Audio UnderstandingMMAU v05.15.25 (test-mini)
Sound Score20.42
28
Audio UnderstandingMMAU (test)
Speech Score15.33
25
ClassificationFSD50K (test)
mAP46.31
24
ClassificationUS8K (test)
Accuracy50.3
22
Audio Question and AnsweringClothoAQA
Accuracy11.9
20
ClassificationESC-50 (test)
Accuracy83.1
16
General Audio UnderstandingMMSU 1.0 (test)
Perception Semantics21.34
16
Audio UnderstandingMMAR (comprehensive evaluation)
Sound Score19.39
15
Environmental Sound ClassificationESC-50 (test)
Top-1 Fidelity83.1
14
Showing 10 of 22 rows

Other info

Code

Follow for update