Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AAD-LLM: Neural Attention-Driven Auditory Scene Understanding

About

Auditory foundation models, including auditory large language models (LLMs), process all sound inputs equally, independent of listener perception. However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perception-aligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios, with both objective and subjective ratings showing improved alignment with listener intention. By taking a first step toward intention-aware auditory AI, this work explores a new paradigm where listener perception informs machine listening, paving the way for future listener-centered auditory systems. Demo and code available: https://aad-llm.github.io.

Xilin Jiang, Sukru Samet Dindar, Vishal Choudhari, Stephan Bickel, Ashesh Mehta, Guy M McKhann, Daniel Friedman, Adeen Flinker, Nima Mesgarani• 2025

Related benchmarks

TaskDatasetResultRank
DescriptioniEEG clinical dataset Background
Avg Score (G, P, T)92.3
14
SummarizationiEEG clinical dataset Foreground
ROUGE-L60.9
14
DescriptioniEEG clinical dataset Foreground
AVG(G, P, T)89.9
14
Free Q&AiEEG clinical dataset Foreground
ROUGE-L63.2
14
Free Q&AiEEG clinical dataset Background
ROUGE-L59.3
14
SummarizationiEEG clinical dataset Background
ROUGE-L44.9
14
TranscriptioniEEG clinical dataset Foreground
WER6
13
TranscriptioniEEG clinical dataset Background
WER22.5
13
Speaker DescriptionLibriTTS + DEMAND mixtures Background
Gender Accuracy99.5
10
SummarizationLibriTTS + DEMAND mixtures Background
ROUGE-L46.3
10
Showing 10 of 20 rows

Other info

Code

Follow for update