Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

About

Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we introduce Fourier Compressor, an effective, parameter-free, and highly generalizable module that removes redundancy from visual representations within the frequency domain. Implemented via FFT with $\mathcal{O}(n^2 \log n)$ complexity and no additional parameters, Fourier Compressor introduces negligible computational overhead while preserving semantic fidelity. Extensive experiments on image-based benchmarks demonstrate that our method achieves a favorable performance-efficiency trade-off, retaining over 96% of the original accuracy while reducing inference FLOPs by up to 83.8% and boosting generation speed by 31.2%. It consistently outperforms existing parameter-free methods and even surpasses some parameterized approaches. Importantly, Fourier Compressor generalizes consistently across both LLaVA and Qwen-VL architectures, and further extends to video understanding tasks, highlighting its practical applicability for efficient VLMs.

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy85.4	2056
Text-based Visual Question Answering	TextVQA	Accuracy57.4	984
Science Question Answering	ScienceQA	Accuracy71.1	916
Massive Multi-discipline Multimodal Understanding	MMMU	Accuracy35.8	249
Visual Question Answering	GQA	Accuracy62.7	218
Multimodal Reasoning	MMBench	Accuracy66.4	180
Visual Instruction Following	LLaVA-Bench Wild	Score69.5	71
Video Understanding	MVBench zero-shot	Accuracy62.9	25
Multimodal Understanding	lmms-eval zero-shot (MME, VQA^T, POPE, RWQA, MMB, MMS)	Average Score72.6	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord