Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Otter: A Multi-Modal Model with In-Context Instruction Tuning

About

Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability. To bridge this gap, we introduce the \textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruction tuned for general purpose multi-modal assistant. Otter seamlessly processes multi-modal inputs, supporting modalities including text, multiple images, and dynamic video content. To support the training of Otter, we present the \textbf{MIMIC-IT} (\textbf{M}ult\textbf{I}-\textbf{M}odal \textbf{I}n-\textbf{C}ontext \textbf{I}nstruction \textbf{T}uning) dataset, which encompasses over 3 million multi-modal instruction-response pairs, including approximately 2.2 million unique instructions across a broad spectrum of images and videos. MIMIC-IT has been carefully curated to feature a diverse array of in-context examples for each entry. Comprehensive evaluations suggest that instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Ziwei Liu• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy21.2
1117
Visual Question AnsweringVizWiz
Accuracy50
1043
Visual Question AnsweringGQA
Accuracy38.1
963
Object Hallucination EvaluationPOPE
Accuracy72.5
935
Multimodal EvaluationMME
Score1.60e+3
557
Multimodal UnderstandingMM-Vet
MM-Vet Score24.6
418
Multimodal UnderstandingMMBench
Accuracy51.4
367
Mathematical ReasoningMathVista
Score19.7
322
Multimodal Capability EvaluationMM-Vet
Score24.7
282
Science Question AnsweringScienceQA IMG
Accuracy27.2
256
Showing 10 of 61 rows

Other info

Follow for update