Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
About
We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy47.6 | 1165 | |
| Visual Question Answering | VizWiz | Accuracy37.6 | 1043 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy47.6 | 337 | |
| Visual Question Answering | OKVQA | Top-1 Accuracy23.8 | 283 | |
| Visual Question Answering | OK-VQA | Accuracy23.8 | 224 | |
| Visual Question Answering | VQAv2 | Accuracy47.6 | 177 | |
| Image Captioning | MS-COCO (test) | CIDEr61.6 | 117 | |
| Image Captioning | COCO | CIDEr61.6 | 116 | |
| Text-to-Image Generation | MS-COCO (val) | FID10.82 | 112 | |
| Text-to-Image Generation | MS-COCO | FID10.3 | 75 |