GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
About
Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Captioning | AudioCaps (test) | CIDEr64.8 | 140 | |
| Audio Question Answering | MMAR | Sd Score29.09 | 17 | |
| Description | iEEG clinical dataset Background | Avg Score (G, P, T)48.2 | 14 | |
| Description | iEEG clinical dataset Foreground | AVG(G, P, T)45.9 | 14 | |
| Free Q&A | iEEG clinical dataset Background | ROUGE-L30.9 | 14 | |
| Summarization | iEEG clinical dataset Background | ROUGE-L22.6 | 14 | |
| Free Q&A | iEEG clinical dataset Foreground | ROUGE-L24 | 14 | |
| Summarization | iEEG clinical dataset Foreground | ROUGE-L19 | 14 | |
| Summarization | LibriTTS + DEMAND mixtures Foreground | ROUGE-L17.8 | 10 | |
| Summarization | LibriTTS + DEMAND mixtures Background | ROUGE-L18.6 | 10 |