Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

C-Pack: Packed Resources For General Chinese Embeddings

About

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, Jian-Yun Nie• 2023

Related benchmarks

TaskDatasetResultRank
Multi-hop Question Answering2WikiMultihopQA
EM45.86
278
Multi-hop Question AnsweringMuSiQue
EM18.36
106
Multi-hop Question AnsweringBamboogle
Exact Match42.4
97
Information RetrievalBEIR (test)--
76
Semantic Textual SimilaritySTS-B
Spearman's Rho (x100)77.63
70
Multi-hop Question AnsweringHotpotQA
Exact Match (EM)44.36
56
Information RetrievalBEIR v1.0.0 (test)
ArguAna63.5
55
Text EmbeddingMTEB
MTEB Score64.23
45
General Question AnsweringTriviaQA
Exact Match63.66
39
General Question AnsweringNQ
Exact Match (EM)40.36
36
Showing 10 of 71 rows
...

Other info

Follow for update