MiniVLM: A Smaller and Faster Vision-Language Model
About
Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding $7M$ Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by $73\%$ and the inference time cost by $94\%$ while being able to retain $94-97\%$ of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr1.198 | 682 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy69.4 | 466 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy69.1 | 337 | |
| Visual Question Answering | VQA (test-std) | Accuracy68.1 | 110 | |
| Image Captioning | COCO (Karpathy split) | CIDEr131.7 | 74 | |
| Image Captioning | MS COCO (Karpathy) | CIDEr-D131.7 | 56 | |
| Visual Reasoning | NLVR2 (test) | Accuracy73.93 | 44 | |
| Image Captioning | COCO (test) | CIDEr115 | 43 | |
| Image-to-Text Retrieval | MSCOCO (test) | R@585.1 | 33 | |
| Text-to-Image Retrieval | MSCOCO (test) | R@574.1 | 25 |