Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
About
Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently. We present Visualized In-Context Text Processing (VisInContext), which processes long in-context text using visual tokens. This technique significantly reduces GPU memory usage and floating point operations (FLOPs) for both training and inferenceing stage. For instance, our method expands the pre-training in-context text length from 256 to 2048 tokens with nearly same FLOPs for a 56 billion parameter MOE model. Experimental results demonstrate that model trained with VisInContext delivers superior performance on common downstream benchmarks for in-context few-shot evaluation. Additionally, VisInContext is complementary to existing methods for increasing in-context text length and enhances document understanding capabilities, showing great potential in document QA tasks and sequential document retrieval.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy51 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy31.2 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy41.2 | 1043 | |
| Visual Question Answering | OK-VQA | Accuracy46.3 | 224 | |
| Document Visual Question Answering | DocVQA (test) | ANLS52.2 | 192 | |
| Image Captioning | Flickr30K | CIDEr Score68.4 | 111 | |
| Image Captioning | MS-COCO | CIDEr101.3 | 61 | |
| Visual Question Answering | DocVQA (val) | ANLS48.5 | 31 | |
| Image Question Answering | OCR-VQA | Accuracy58.4 | 27 | |
| Image Classification | Hateful Memes | Accuracy65.5 | 11 |