Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Contextual Document Embeddings

About

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.

John X. Morris, Alexander M. Rush• 2024

Related benchmarks

TaskDatasetResultRank
Long document retrievalLongBench Retrieval v2 (full)
F1 Score0.0521
55
Single-document retrievalConditionalQA
F115.47
44
Single-document retrievalQasper
F1 Score9.87
44
Single-document retrievalQASA
F1 Score13.41
44
Single-document retrievalRepLiQA
F1 Score0.2598
44
Single-document retrievalNaturalQuestions
F1 Score24.16
44
Single-document retrievalRepLiQA
Latency (s)0.1305
11
Single-document retrievalConditionalQA
Latency (s)0.3393
11
Single-document retrievalNaturalQuestions
Latency (s)0.2442
11
Single-document retrievalQasper
Latency (s)0.3298
11
Showing 10 of 21 rows

Other info

Follow for update