SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
About
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy81.1 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy68 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy49.6 | 1043 | |
| Visual Question Answering | GQA | Accuracy58 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy89.6 | 935 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy75.5 | 664 | |
| Multimodal Evaluation | MME | Score1.85e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy57.8 | 496 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score36.5 | 418 | |
| Visual Question Answering | GQA | Accuracy58 | 374 |