Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

About

Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval. Our code is available at https://github.com/Z1zs/Causal-Embed.

Jiahao Huo, Yu Huang, Yibo Yan, Ye Pan, Yi Cao, Mingdong Ou, Philip S. Yu, Xuming Hu• 2026

Related benchmarks

TaskDatasetResultRank
Visual document retrievalViDoRe V3
HR42.9
23
Visual document retrievalViDoRe V2
ESG Score49.8
14
Document RetrievalViDoRe V1
Arxiv Score80.7
14
Showing 3 of 3 rows

Other info

Follow for update