Otter: A Multi-Modal Model with In-Context Instruction Tuning

About

Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal \textbf{I}n-\textbf{C}ontext \textbf{I}nstruction \textbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Ziwei Liu• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy72.5	2019
Visual Question Answering	VizWiz	Accuracy50	1820
Visual Question Answering	TextVQA	Accuracy21.2	1453
Visual Question Answering	GQA	Accuracy38.1	1425
Multimodal Understanding	MMBench	Accuracy51.4	847
Multimodal Evaluation	MME	Score1.60e+3	727
Multimodal Understanding	MM-Vet	MM-Vet Score24.6	631
Video Understanding	MVBench	--	563
Multimodal Understanding	SEED-Bench	Accuracy32.9	516
Mathematical Reasoning	MathVista	Score19.7	474

Showing 10 of 86 rows

...

Other info

Follow for update

@wizwand_team Discord