Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OmniFusion Technical Report

About

Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekeeping, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion.

Elizaveta Goncharova, Anton Razzhigaev, Matvey Mikhalchuk, Maxim Kurkin, Irina Abdullaeva, Matvey Skripkin, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy80.94
1165
Visual Question AnsweringGQA
Accuracy65.72
963
Object Hallucination EvaluationPOPE
Accuracy87.21
935
Multimodal UnderstandingMM-Vet
MM-Vet Score39.4
418
Multimodal ReasoningMM-Vet
MM-Vet Score39.4
281
Multimodal UnderstandingMMMU
Accuracy36.9
275
Multi-discipline Multimodal UnderstandingMMMU--
266
Science Question AnsweringScienceQA SQA-IMG
Accuracy69.2
114
Visual Question AnsweringSciQA-IMG
Accuracy71.29
53
Multimodal UnderstandingMMB
Score69
30
Showing 10 of 11 rows

Other info

Code

Follow for update