Matryoshka Multimodal Models
About
Large Multimodal Models (LMMs) such as LLaVA have shown strong performance in visual-linguistic reasoning. These models first embed images into a fixed large number of visual tokens and then feed them into a Large Language Model (LLM). However, this design causes an excessive number of tokens for dense visual scenarios such as high-resolution images and videos, leading to great inefficiency. While token pruning/merging methods do exist, they produce a single length output for each image and do not afford flexibility in trading off information density v.s. efficiency. Inspired by the concept of Matryoshka Dolls, we propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities. Our approach offers several unique benefits for LMMs: (1) One can explicitly control the visual granularity per test instance during inference, e.g. , adjusting the number of tokens used to represent an image based on the anticipated complexity or simplicity of the content; (2) M3 provides a framework for analyzing the granularity needed for existing datasets, where we find that COCO-style benchmarks only need around ~9 visual tokens to obtain accuracy similar to that of using all 576 tokens; (3) Our approach provides a foundation to explore the best trade-off between performance and visual token length at sample level, where our investigation reveals that a large gap exists between the oracle upper bound and current fixed-scale representations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy85.5 | 935 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy76.9 | 664 | |
| Science Question Answering | ScienceQA IMG | Accuracy68.2 | 256 | |
| Multimodal Evaluation | MM-Vet | -- | 122 | |
| Video Question Answering | NExT-QA Multi-choice | Accuracy63.1 | 102 | |
| Visual Question Answering | VizWiz (test-dev) | Accuracy52.8 | 65 | |
| Multi-modal Evaluation | MME (total) | MME Total Score1.42e+3 | 61 | |
| Multimodal Benchmarking | MMBench English | Accuracy64.8 | 61 | |
| Multiple-choice Video Question Answering | EgoSchema | Accuracy36.8 | 61 | |
| Multiple Choice VideoQA | IntentQA | Accuracy58.8 | 41 |