Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

C-Pack: Packed Resources For General Chinese Embeddings

About

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, Jian-Yun Nie• 2023

Related benchmarks

TaskDatasetResultRank
Multi-hop Question Answering2WikiMultihopQA
EM45.86
559
Question Answering2Wiki--
241
Multi-hop Question AnsweringMuSiQue
EM18.36
209
Semantic Textual SimilaritySTS-B
Spearman's Rho (x100)77.63
156
Multi-hop Question AnsweringHotpotQA
Exact Match (EM)44.36
150
Question AnsweringNQ (test)
EM Accuracy52.2
133
Multi-hop Question AnsweringBamboogle
Exact Match42.4
128
Information RetrievalBEIR (test)--
126
Information RetrievalBRIGHT
Mean nDCG@1011.4
94
Question AnsweringMuSiQue
F1 Score19.7
79
Showing 10 of 212 rows
...

Other info

Follow for update