Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

About

We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with language tokens. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces the amount of training data and improves the performance of the unified model. With the same LLM size, our method improved the understanding performance by 4.8% compared to the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7%. Our model also surpasses the existing unified models on visual generation benchmarks.

Rongchang Xie, Chen Du, Ping Song, Chang Liu• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
1455
Multimodal EvaluationMME--
658
Mathematical ReasoningMathVista
Score55.9
385
Multimodal UnderstandingSEED-Bench--
343
Multi-discipline Multimodal UnderstandingMMMU--
317
Text-to-Image GenerationGenEval
Overall Score57
218
Visual UnderstandingMM-Vet
MM-Vet Score55.9
142
Vision UnderstandingMMBench--
141
Text-to-Image GenerationGenEval
GenEval Score0.57
88
Image ReconstructionImageNet-1k 256 x 256 (val)
rFID2.26
77
Showing 10 of 16 rows

Other info

Follow for update