Nomic Embed: Training a Reproducible Long Context Text Embedder
About
This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.
Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Information Retrieval | BEIR | TREC-COVID0.63 | 59 | |
| Natural Language Understanding | GLUE (test val) | MRPC Accuracy88 | 59 | |
| Information Retrieval | BEIR v1.0.0 (test) | ArguAna52.45 | 55 | |
| Text Embedding | MTEB English v2 | Mean Score62.2 | 50 | |
| Open Hours | Houston (test) | F1 Score72.1 | 28 | |
| Busyness | Houston (test) | MAE0.162 | 28 | |
| Price Level | Los Angeles | Accuracy0.614 | 28 | |
| Price Level | Houston (test) | Accuracy57.8 | 28 | |
| Busyness | Los Angeles | MAE0.168 | 28 | |
| Permanent Closure | Los Angeles | F1 Score74.9 | 28 |
Showing 10 of 19 rows