Megrez-Omni Technical Report

About

In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.

Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Shengen Yan, Guohao Dai, Yu Wang• 2025

Related benchmarks

Task	Dataset	Result
Audio Understanding	MMSU	Perception Score32.5	37
SlideASR	SlideSpeech (dev)	--	16
SlideASR	SlideSpeech (test)	--	16
Open QA (Speech to Text)	AgriBench-Omni 1.0 (test)	Accuracy (Cn)80	11
Multiple-Choice Reasoning	Agricultural Benchmark Speech + Image + Text 1.0 (test)	Acc (CN)49	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord