PLAID: An Efficient Engine for Late Interaction Retrieval

About

Pre-trained language models are increasingly important components across multiple information retrieval (IR) paradigms. Late interaction, introduced with the ColBERT model and recently refined in ColBERTv2, is a popular paradigm that holds state-of-the-art status across many benchmarks. To dramatically speed up the search latency of late interaction, we introduce the Performance-optimized Late Interaction Driver (PLAID). Without impacting quality, PLAID swiftly eliminates low-scoring passages using a novel centroid interaction mechanism that treats every passage as a lightweight bag of centroids. PLAID uses centroid interaction as well as centroid pruning, a mechanism for sparsifying the bag of centroids, within a highly-optimized engine to reduce late interaction search latency by up to 7$\times$ on a GPU and 45$\times$ on a CPU against vanilla ColBERTv2, while continuing to deliver state-of-the-art retrieval quality. This allows the PLAID engine with ColBERTv2 to achieve latency of tens of milliseconds on a GPU and tens or just few hundreds of milliseconds on a CPU at large scale, even at the largest scales we evaluate with 140M passages.

Keshav Santhanam, Omar Khattab, Christopher Potts, Matei Zaharia• 2022

Related benchmarks

Task	Dataset	Result
Information Retrieval	BEIR (test)	--	126
Information Retrieval	MS-MARCO (test)	--	56
Zero-shot Information Retrieval	BEIR	NFCorpus NDCG@10 (Zero-shot)33.8	38
End-to-end Retrieval	LoTTE	Latency (ms)288	26
End-to-end Retrieval	MSMARCO	Latency (ms)222	18
Semantic Relatedness	BEIR Semantic Relatedness Tasks (test)	ArguAna Score42.06	16
Information Retrieval	LoTTE Search (test)	Lifestyle Score84.3	9
Information Retrieval	LoTTE Forum (test)	IR Score (Lifestyle)76.7	9
Information Retrieval	Quora	QPS89	9
Information Retrieval	ArguAna	QPS76	9

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord