Chameleon: Mixed-Modal Early-Fusion Foundation Models
About
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy66 | 1165 | |
| Visual Question Answering | TextVQA | -- | 1117 | |
| Visual Question Answering | GQA | Accuracy66 | 963 | |
| Object Hallucination Evaluation | POPE | -- | 935 | |
| Image Captioning | MS COCO Karpathy (test) | CIDEr0.1372 | 682 | |
| Text-based Visual Question Answering | TextVQA | Accuracy4.8 | 496 | |
| Text-to-Image Generation | GenEval | Overall Score39 | 467 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score8.3 | 418 | |
| Multimodal Understanding | MMBench | -- | 367 | |
| Mathematical Reasoning | MathVista | Score22.3 | 322 |