Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Matryoshka Query Transformer for Large Vision-Language Models

About

Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m <= M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA's fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy54.1
1525
Object Hallucination EvaluationPOPE
Accuracy86.2
1455
Visual Question AnsweringVQA v2
Accuracy77.9
1362
Visual Question AnsweringTextVQA
Accuracy60.2
1285
Visual Question AnsweringGQA
Accuracy61.6
1249
Text-based Visual Question AnsweringTextVQA
Accuracy53.4
807
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy75.3
706
Multimodal EvaluationMME
Score1.47e+3
658
Multimodal UnderstandingMMMU
Accuracy34.8
437
Multimodal ReasoningMM-Vet
MM-Vet Score29.8
431
Showing 10 of 33 rows

Other info

Follow for update