PaLI-X: On Scaling up a Multilingual Vision and Language Model
About
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy86.1 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy80.78 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy70.9 | 1043 | |
| Image Captioning | MS COCO Karpathy (test) | CIDEr1.492 | 682 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy86 | 664 | |
| Image Classification | ImageNet A | Top-1 Acc73.47 | 553 | |
| Image Classification | ImageNet V2 | Top-1 Acc83.66 | 487 | |
| Video Question Answering | MSRVTT-QA | Accuracy47.1 | 481 | |
| Image Classification | ImageNet-R | Top-1 Acc82.96 | 474 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy86.1 | 466 |