Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

About

Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability. Specifically, Vary is competent in new document parsing features (OCR or markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet. Our code will be publicly available on the homepage.

Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, Xiangyu Zhang• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	ChartQA	Accuracy65.3	519
Chart Question Answering	ChartQA	Accuracy66.1	371
Document Visual Question Answering	DocVQA	ANLS76.3	301
Document Visual Question Answering	DocVQA (test)	ANLS76.3	292
Visual Question Answering	DocVQA	Accuracy76.3	205
Multimodal Understanding	MM-VET (test)	Total Score36.2	120
Visual Question Answering	DocVQA (val)	ANLS78.2	47
Text Recognition	SROIE Task 2 (test)	F1 Score9.84	19
Document Image Retrieval	NL-DIR (test)	Recall@10.01	15
OCR VQA	DOCVQA, INFOVQA, CHARTQA, and TEXTVQA Average	Final FR9.2	13

Showing 10 of 24 rows

Other info

Code

Follow for update

@wizwand_team Discord