Towards General Text Embeddings with Multi-stage Contrastive Learning

About

We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE$_\text{base}$ outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of similar size by treating code as text. In summary, our model achieves impressive results by effectively harnessing multi-stage contrastive learning, offering a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks.

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang• 2023

Related benchmarks

Task	Dataset	Result
Multi-hop QA	HotpotQA	Exact Match58.6	143
Information Retrieval	BEIR	SciFact0.753	120
Text Embedding	MTEB English v2	Mean Score67.2	107
Multi-hop QA	MuSiQue	EM30.6	95
Information Retrieval	BRIGHT	Mean nDCG@1023.5	94
Retrieval	Natural Questions (test)	Top-5 Recall74.3	62
Natural Language Understanding	GLUE (test val)	MRPC Accuracy92.1	59
Sentence Embedding Evaluation	MTEB (test)	Classification Score86.58	55
Single-hop QA	NQ (Natural Questions)	EM46.6	52
Tool Calling	API-Bank L-1	--	46

Showing 10 of 135 rows

...

Other info

Follow for update

@wizwand_team Discord