Reconstruction Alignment Improves Unified Multimodal Models

About

Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details, even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RECA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts", providing rich supervision without captions. Concretely, RECA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RECA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU hours, post-training with RECA substantially improves image generation performance on GenEval (0.73 $\rightarrow$ 0.90) and DPGBench (80.93 $\rightarrow$ 88.15), while also boosting editing benchmarks (ImgEdit 3.38 $\rightarrow$ 3.75, GEdit 6.94 $\rightarrow$ 7.27). Notably, RECA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs.

Ji Xie, Trevor Darrell, Luke Zettlemoyer, XuDong Wang• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Visual Question Answering	GQA	Accuracy58.5	1445
Text-to-Image Generation	GenEval	Overall Score90	914
Multimodal Understanding	MMBench	--	887
Multimodal Understanding	MM-Vet	MM-Vet Score66.1	664
Text-to-Image Generation	GenEval	Overall Score85.2	581
Text-to-Image Generation	GenEval	Overall Score (GenEval)0.9	153
Text-to-Image Generation	DPGBench	DPGBench Score88.15	133
Visual Perception	MMVP	--	118
Multimodal Understanding	MMMU	MMMU Score52.3	110

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord