WARP: An Efficient Engine for Multi-Vector Retrieval

About

Multi-vector retrieval methods such as ColBERT and its recent variant, the ConteXtualized Token Retriever (XTR), offer high accuracy but face efficiency challenges at scale. To address this, we present WARP, a retrieval engine that substantially improves the efficiency of retrievers trained with the XTR objective through three key innovations: (1) WARP$_\text{SELECT}$ for dynamic similarity imputation; (2) implicit decompression, avoiding costly vector reconstruction during retrieval; and (3) a two-stage reduction process for efficient score aggregation. Combined with highly-optimized C++ kernels, our system reduces end-to-end latency compared to XTR's reference implementation by 41x, and achieves a 3x speedup over the ColBERTv2/PLAID engine, while preserving retrieval quality.

Jan Luca Scheerer, Matei Zaharia, Christopher Potts, Gustavo Alonso, Omar Khattab• 2025

Related benchmarks

Task	Dataset	Result
Retrieval	MS MARCO V1	Retrieval Latency (ms)72.8	66
Information Retrieval	LoTTE pooled (test)	Retrieval Time (ms)39	41
End-to-end Retrieval	LoTTE	Latency (ms)49	26

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord