MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
About
Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MMBench | -- | 847 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score62.9 | 631 | |
| Multimodal Understanding | SEED-Bench | Accuracy75.5 | 516 | |
| Text-to-Image Generation | GenEval | GenEval Score88 | 442 | |
| Diagram Understanding | AI2D | Accuracy80.2 | 317 | |
| Multimodal Understanding | MMMU | MMMU Score49.8 | 232 | |
| Visual Perception | MMVP | Accuracy74.8 | 118 | |
| Multimodal Perception | MME Perception | Perception Score1.65e+3 | 99 | |
| Image Editing | ImgEdit | Add Score4.35 | 81 | |
| Text-to-Image Generation | WISE | WISE Score0.65 | 67 |