Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Nomic Embed: Training a Reproducible Long Context Text Embedder

About

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar• 2024

Related benchmarks

TaskDatasetResultRank
Information RetrievalBEIR
TREC-COVID0.63
59
Natural Language UnderstandingGLUE (test val)
MRPC Accuracy88
59
Information RetrievalBEIR v1.0.0 (test)
ArguAna52.45
55
Text EmbeddingMTEB English v2
Mean Score62.2
50
Open HoursHouston (test)
F1 Score72.1
28
BusynessHouston (test)
MAE0.162
28
Price LevelLos Angeles
Accuracy0.614
28
Price LevelHouston (test)
Accuracy57.8
28
BusynessLos Angeles
MAE0.168
28
Permanent ClosureLos Angeles
F1 Score74.9
28
Showing 10 of 19 rows

Other info

Follow for update