Locality Matters for Training-Free Audio Token Compression in Audio-Language Models

About

Audio-language models (ALMs) are increasingly used for audio captioning, question answering, and open-ended audio understanding, but their inference cost remains high when audio inputs are represented as long prefix-token sequences. These audio prefixes consume context budget, increase memory usage, and make deployment harder in resource-constrained or latency-sensitive settings. Existing training-free audio-token reduction methods mainly rely on fixed pooling or score-based pruning. Fixed pooling is content-agnostic, while score-based pruning can preserve isolated salient tokens but discard nearby acoustic context. We propose Local Temporal Bipartite Merging (LTBM), a training-free encoder-space compression method that merges similar nearby audio tokens under an explicit temporal window constraint. Beyond introducing LTBM, we use a controlled Global Merge variant to isolate whether temporal locality itself is a useful inductive bias for audio-token compression. Experiments on AudioCaps, Clotho, and MMAU with Qwen2-Audio show evidence of a task-dependent locality effect: locality-aware merging is more favorable for captioning at several compression settings, especially under stronger compression, while global matching is more competitive for multiple-choice audio understanding. A cross-backbone validation on Audio Flamingo 3 further supports the captioning-side advantage of locality-aware merging under moderate and aggressive compression.

Jiale Luo, Xiaoyu Liang, Haoji Hu• 2026

Related benchmarks

Task	Dataset	Result
Audio Captioning	Clotho	CIDEr18.38	82
Audio Captioning	AudioCaps	CIDEr49.51	66
Multiple-choice audio understanding	MMAU mini (test)	Average Accuracy55.6	39

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord