When LLaVA Meets Objects: Token Composition for Vision-Language-Models

About

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

Soumya Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy85.8	2019
Visual Question Answering	VizWiz	Accuracy51.8	1820
Science Question Answering	ScienceQA	Accuracy70.8	791
Multimodal Evaluation	MME	Score1.44e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy74.8	712
Visual Question Answering	GQA	Accuracy60.2	524
Multimodal Capability Evaluation	MM-Vet	Score31.1	393
Science Question Answering	ScienceQA IMG	Accuracy68.8	335
Multimodal Model Evaluation	MMBench	Accuracy64.9	204
Multimodal Evaluation	MM-Vet	Score25.7	196

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord