Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

About

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, Ping Luo• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy59.1
963
Object Hallucination EvaluationPOPE
Accuracy87
935
Multimodal EvaluationMME--
557
Text-to-Image GenerationGenEval
Overall Score81
467
Multimodal UnderstandingMM-Vet
MM-Vet Score60.1
418
Multimodal UnderstandingMMBench
Accuracy69.4
367
Multimodal Capability EvaluationMM-Vet
Score34.3
282
Multimodal ReasoningMM-Vet
MM-Vet Score34.3
281
Text-to-Image GenerationGenEval
GenEval Score61
277
Multimodal UnderstandingMMMU
Accuracy30.5
275
Showing 10 of 60 rows

Other info

Code

Follow for update