Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

About

We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy70.6
351
Referring Expression ComprehensionRefCOCO+ (val)--
345
Referring Expression ComprehensionRefCOCO (val)--
335
Referring Expression ComprehensionRefCOCO (testA)--
333
Referring Expression ComprehensionRefCOCOg (val)--
291
Referring Expression ComprehensionRefCOCOg (test)--
291
Visual Question AnsweringChartQA
Accuracy80.8
239
Referring Expression ComprehensionRefCOCO+ (testB)--
235
Referring Expression ComprehensionRefCOCO+ (testA)--
207
Referring Expression ComprehensionRefCOCO (testB)--
196
Showing 10 of 28 rows

Other info

Follow for update