Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models

About

Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak visual semantics. It enables the preservation of visual semantics dominated by few visual tokens and the restoration of diluted visual semantics. Additionally, we inject FMVR into Matryoshka Representation Learning to learn coarse-to-fine visual token sets, thus enabling to elastically adjust the number of visual tokens during inference while maintaining comparable performance. Experiments across 10 image-based and 4 video-based bench marks demonstrate that FMVR-LLaVA reduce the FLOPs of LLaVA-1.5-7B by 89%, while maintaining almost 100% of the original accuracy. The code will be open.

Qingtao Pan, Zhihao Dou, Shuo Li• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy57.4
1525
Object Hallucination EvaluationPOPE
Accuracy88.1
1455
Visual Question AnsweringVQA v2
Accuracy80.7
1362
Visual Question AnsweringTextVQA
Accuracy61.3
1285
Text-based Visual Question AnsweringTextVQA
Accuracy59.2
807
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy82.4
706
Multimodal EvaluationMME
Score1.53e+3
658
Visual Question AnsweringGQA
Accuracy62.5
505
Visual Question AnsweringGQA
Mean Accuracy64.2
196
Scientific Question AnsweringScienceQA image
Accuracy70.6
184
Showing 10 of 20 rows

Other info

Follow for update