Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents

About

Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings 2, an open-source text embedding model capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings 2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI's proprietary ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.

Michael G\"unther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, Han Xiao• 2023

Related benchmarks

TaskDatasetResultRank
Information RetrievalBEIR v1.0.0 (test)
ArguAna46.7
55
AttributionVerifiability-Granular (test)
Attribution Accuracy67.1
28
multilingual long-doc retrievalMLDR (test)
Average Retrieval Score37
14
Community DetectionArxiv 2023
Topsis Score0.3402
13
Community DetectionInstagram
Topsis Score0.2719
13
Community DetectionCiteseer
Topsis Score0.3836
13
Community DetectionCora
Topsis Score0.3176
13
Community DetectionwikiCS
Topsis Score0.3772
13
Community DetectionPubmed
Topsis Score0.2714
13
Community DetectionCora
DBI3.7387
13
Showing 10 of 12 rows

Other info

Follow for update