Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

About

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, Fahad Shahbaz Khan• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.4	2019
Visual Question Answering	GQA	Accuracy62.9	1425
Multimodal Understanding	MM-Vet	MM-Vet Score38.1	631
Visual Question Answering	ChartQA	Accuracy75.2	519
Text-to-Image Generation	GenEval	Overall Score74	517
Multimodal Understanding	MMMU	MMMU Score34.6	232
Multimodal Understanding	SEED	Accuracy69.4	216
Visual Question Answering	TextVQA	Accuracy67.8	94
Multimodal Understanding	Multiple Datasets Aggregate	Average Score62.1	14
Image Generation	MacBook Performance Benchmark M2 Pro	Latency (s)4	4

Showing 10 of 15 rows

Other info

GitHub

Follow for update

@wizwand_team Discord