Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

About

The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow

Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval 2
Soft TIFA AM63.5
17
Text-to-Image GenerationConceptMix k=2
Concept Fraction Score76.68
16
Text-to-Image GenerationConceptMix k=4
Concept Fraction Score67.75
16
Text-to-Image GenerationConceptMix k = 2 1.0
Full Mark Score45.6
16
Text-to-Image GenerationConceptMix k = 7 1.0
Full Mark Score4.03
16
Text-to-Image GenerationConceptMix k=3
Concept Fraction Score71.93
16
Text-to-Image GenerationConceptMix k=5
Concept Fraction Score67.05
16
Text-to-Image GenerationConceptMix k=6
Concept Fraction Score63.9
16
Text-to-Image GenerationConceptMix k=7
Concept Fraction Score63.48
16
Text-to-Image GenerationConceptMix k = 1 1.0
Full Mark Score69.93
16
Showing 10 of 15 rows

Other info

Follow for update