MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
About
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at https://github.com/PhoenixZ810/MG-LLaVA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy70 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy60 | 1043 | |
| Visual Question Answering | GQA | Accuracy62.7 | 963 | |
| Multimodal Evaluation | MME | -- | 557 | |
| Video Question Answering | MSRVTT-QA | Accuracy59.8 | 481 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score41 | 418 | |
| Video Question Answering | MSVD-QA | Accuracy71.5 | 340 | |
| Visual Question Answering | TextVQA (val) | VQA Score67.3 | 309 | |
| Multimodal Understanding | MMMU | Accuracy35.3 | 275 | |
| Visual Question Answering | ChartQA | Accuracy40.8 | 239 |