Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Emu3.5: Native Multimodal Models are World Learners

About

We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jinsheng Wang, Wenxuan Wang, Yueze Wang, Chengyuan Wang, Fan Zhang, Yingli Zhao, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score86
506
Optical Character RecognitionOCRBench
Score68.7
232
Multimodal UnderstandingSEED
Accuracy68.2
183
Text-to-Image GenerationDPG
Overall Score88.26
172
Multimodal ReasoningMMMU
Accuracy31.6
130
Image EditingGEdit-Bench
Semantic Consistency8.11
92
Instruction-based Image EditingImgEdit Bench 1.0 (test)
Add Score4.61
37
CombinedMultilingual Benchmark
IA Score5.76
34
Lineart ColorizationLineart Colorization 900 samples (test)
Image Alignment Score81.44
23
AddMultilingual Benchmark
IA5.69
17
Showing 10 of 25 rows

Other info

Follow for update