Olympus: A Universal Task Router for Computer Vision Tasks

About

We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates over 20 specialized tasks across images, videos, and 3D objects to dedicated modules. This instruction-based routing enables complex workflows through chained actions without the need for training heavy generative models. Olympus easily integrates with existing MLLMs, expanding their capabilities with comparable performance. Experimental results demonstrate that Olympus achieves an average routing accuracy of 94.75% across 20 tasks and precision of 91.82% in chained action scenarios, showcasing its effectiveness as a universal task router that can solve a diverse range of computer vision tasks. Project page: http://yuanze-lin.me/Olympus_page/

Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip H. S. Torr• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Visual Question Answering	VizWiz	Accuracy48.2	1820
Visual Question Answering	TextVQA	Accuracy53.4	1453
Visual Question Answering	GQA	Accuracy63.9	1425
Multi-discipline Multimodal Understanding	MMMU	Accuracy32.8	363
Science Question Answering	ScienceQA IMG	Accuracy70.7	335
Visual Question Answering	VQAv2	Accuracy80.5	196
Multimodal Benchmark	MMBench (MMB)	Accuracy71.2	95
Multimodal Cognition	MME Cognition	Cognition Score283.2	34
Perception Evaluation	MME Perception	Score1.52e+3	21

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord