Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

About

Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we introduce Fourier Compressor, an effective, parameter-free, and highly generalizable module that removes redundancy from visual representations within the frequency domain. Implemented via FFT with $\mathcal{O}(n^2 \log n)$ complexity and no additional parameters, Fourier Compressor introduces negligible computational overhead while preserving semantic fidelity. Extensive experiments on image-based benchmarks demonstrate that our method achieves a favorable performance-efficiency trade-off, retaining over 96% of the original accuracy while reducing inference FLOPs by up to 83.8% and boosting generation speed by 31.2%. It consistently outperforms existing parameter-free methods and even surpasses some parameterized approaches. Importantly, Fourier Compressor generalizes consistently across both LLaVA and Qwen-VL architectures, and further extends to video understanding tasks, highlighting its practical applicability for efficient VLMs.

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy85.4
2019
Text-based Visual Question AnsweringTextVQA
Accuracy57.4
962
Science Question AnsweringScienceQA
Accuracy71.1
791
Massive Multi-discipline Multimodal UnderstandingMMMU
Accuracy35.8
216
Visual Question AnsweringGQA
Accuracy62.7
155
Multimodal ReasoningMMBench
Accuracy66.4
127
Visual Instruction FollowingLLaVA-Bench Wild
Score69.5
71
Video UnderstandingMVBench zero-shot
Accuracy62.9
25
Multimodal Understandinglmms-eval zero-shot (MME, VQA^T, POPE, RWQA, MMB, MMS)
Average Score72.6
4
Showing 9 of 9 rows

Other info

Follow for update