EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs

About

Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational costs of processing massive, redundant audio-visual tokens. Existing unimodal compression techniques fail to capture the heterogeneous and mutually influential information density of joint audio-visual signals. Furthermore, we identify a fundamental and overlooked theoretical bottleneck in sparse token reduction: positional aliasing. We demonstrate that aggressive sparse sampling on standard position-encoded sequences violates the Nyquist limit relative to the effective token interval, causing phase-wrapping collisions that corrupt temporal monotonicity. To address this, we introduce EchoingPixels, a framework for aliasing-resistant joint token reduction. Our Cross-Modal Semantic Sieve performs extractive selection on the synergistic audio-visual stream, dynamically allocating budgets based on joint-modality saliency rather than fixed per-modality ratios. To resolve positional aliasing, we derive Sync-RoPE, a spectral low-pass filter for Rotary Positional Embeddings that adapts encoding bandwidth to the sparse sampling rate, preserving monotonic temporal relationships in the reduced stream. Experiments show that EchoingPixels achieves performance comparable to full models using only 5-20% of original tokens, validating theoretically grounded sparse learning as a robust solution for efficient AV-LLMs. Code is available at https://github.com/CharlesGong12/EchoingPixels.

Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, Jingjing Chen• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MLVU	Accuracy68.3	80
Audio-visual understanding	WorldSense	Accuracy47.4	72
Audio-visual understanding	Daily-Omni	Accuracy60.65	60
Audio-Visual Perception	WorldSense	Score47.4	26
Video Understanding	MLVU (dev)	MLVU Dev Score68.3	24
Audio-visual understanding	Video-MME	Score64.1	15
Video Understanding	Video-MME w/o audio	Accuracy58.6	13
Audio-visual understanding	Video-MME w/ audio	Accuracy64.1	10
Audio-Visual Perception	Daily-Omni	Score60.65	8
Multimodal Understanding	Aggregate Audio-Visual & Video Benchmarks	Avg Audio-Visual Score56.2	8

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord