Diffusion-Pretrained Dense and Contextual Embeddings
About
In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, focusing on real-world, large-scale search scenarios constructed from 1B production web pages. These results validate the models' effectiveness in production environments where retrieval quality and efficiency are critical at scale.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | ASQA | -- | 27 | |
| Chunk-level retrieval | ConTEB | Avg nDCG@1081.96 | 13 | |
| Multilingual Retrieval | MTEB Multilingual v2 | -- | 11 | |
| Tool Retrieval | ToolRet | Web nDCG@1042.07 | 10 | |
| Code Retrieval | MTEB Code | -- | 10 | |
| Information Retrieval | MIRACL RetrievalHardNegatives | Average Performance68.6 | 9 | |
| Query-to-Document Retrieval | PPLXQuery2Doc Multilingual Small 7.5M | R@1021.05 | 7 | |
| Query-to-Document Retrieval | PPLXQuery2Doc Multilingual Medium 15M | R@1017.87 | 7 | |
| Query-to-Document Retrieval | PPLXQuery2Doc Multilingual Large 30M | R@1015.58 | 7 | |
| Query-to-Document Retrieval | PPLXQuery2Doc English Small 7.5M corpus 1.0 | Recall@1016.29 | 7 |