Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Very Efficient Listwise Multimodal Reranking for Long Documents

About

Listwise reranking is a key yet computationally expensive component in vision-centric retrieval and multimodal retrieval-augmented generation (M-RAG) over long documents. While recent VLM-based rerankers achieve strong accuracy, their practicality is often limited by long visual-token sequences and multi-step autoregressive decoding. We propose ZipRerank, a highly efficient listwise multimodal reranker that directly addresses both bottlenecks. It reduces input length via a lightweight query-image early interaction mechanism and eliminates autoregressive decoding by scoring all candidates in a single forward pass. To enable effective learning, ZipRerank adopts a two-stage training strategy: (i) listwise pretraining on large-scale text data rendered as images, and (ii) multimodal finetuning with VLM-teacher-distilled soft-ranking supervision. Extensive experiments on the MMDocIR benchmark show that ZipRerank matches or surpasses state-of-the-art multimodal rerankers while reducing LLM inference latency by up to an order of magnitude, making it well-suited for latency-sensitive real-world systems. The code is available at https://github.com/dukesun99/ZipRerank.

Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh• 2026

Related benchmarks

TaskDatasetResultRank
Page-level rerankingMMDocIR 1.0 (test)
Recall (Resolution)92.6
24
Multimodal RerankingViDoRe English
NDCG@559.9
16
Multimodal Page-level RerankingMMDocIR v1 (test)
Recall@163.3
10
Multimodal RerankingMMDocIR (test)
Vision Latency (ms)180.2
5
Showing 4 of 4 rows

Other info

Follow for update