Nomic Embed: Training a Reproducible Long Context Text Embedder

About

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar• 2024

Related benchmarks

Task	Dataset	Result
Information Retrieval	BEIR	SciFact0.703	120
Text Embedding	MTEB English v2	Mean Score62.2	107
Information Retrieval	BEIR v1.0.0 (test)	ArguAna52.45	75
Natural Language Understanding	GLUE (test val)	MRPC Accuracy88	59
Open Hours	Houston (test)	F1 Score72.1	28
Busyness	Houston (test)	MAE0.162	28
Price Level	Los Angeles	Accuracy0.614	28
Price Level	Houston (test)	Accuracy57.8	28
Busyness	Los Angeles	MAE0.168	28
Permanent Closure	Los Angeles	F1 Score74.9	28

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord