OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
About
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters. OpenFlamingo is an ongoing effort to produce an open-source replication of DeepMind's Flamingo models. On seven vision-language datasets, OpenFlamingo models average between 80 - 89% of corresponding Flamingo performance. This technical report describes our models, training data, hyperparameters, and evaluation suite. We share our models and code at https://github.com/mlfoundations/open_flamingo.
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt• 2023
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy54.8 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy54.7 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy44 | 1043 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy54.8 | 664 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score21.8 | 418 | |
| Multimodal Understanding | MMBench | Accuracy6.6 | 367 | |
| Visual Question Answering | TextVQA (val) | VQA Score2.83e+3 | 309 | |
| Visual Question Answering | OKVQA | Top-1 Accuracy37.8 | 283 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score24.8 | 281 | |
| Video Understanding | MVBench | Accuracy7.9 | 247 |
Showing 10 of 90 rows
...