Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

State-Space Large Audio Language Models

About

Large Audio Language Models (LALM) combine the audio perception models and the Large Language Models (LLM) and show a remarkable ability to reason about the input audio, infer the meaning, and understand the intent. However, these systems rely on Transformers which scale quadratically with the input sequence lengths which poses computational challenges in deploying these systems in memory and time-constrained scenarios. Recently, the state-space models (SSMs) have emerged as an alternative to transformer networks. While there have been successful attempts to replace transformer-based audio perception models with state-space ones, state-space-based LALMs remain unexplored. First, we begin by replacing the transformer-based audio perception module and then replace the transformer-based LLM and propose the first state-space-based LALM. Experimental results demonstrate that space-based LALM despite having a significantly lower number of parameters performs competitively with transformer-based LALMs on close-ended tasks on a variety of datasets.

Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass• 2024

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy86.8
366
Environmental Sound ClassificationFSD50K
mAP47.7
91
Audio ClassificationVGG-Sound
Top-1 Accuracy51
83
Audio CaptioningClotho--
60
Audio ClassificationAudioSet
mAP19.4
46
Acoustic Scene ClassificationTUT Acoustic Scenes
Accuracy35.9
35
Audio ClassificationBeijing Opera
Base Accuracy61
34
Acoustic Scene ClassificationDCASE
Mi-F1 Score47.9
21
Audio ClassificationVocalSound
Accuracy61.2
21
Showing 9 of 9 rows

Other info

Follow for update