C-Pack: Packed Resources For General Chinese Embeddings
About
We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | 2WikiMultihopQA | EM45.86 | 278 | |
| Multi-hop Question Answering | MuSiQue | EM18.36 | 106 | |
| Multi-hop Question Answering | Bamboogle | Exact Match42.4 | 97 | |
| Information Retrieval | BEIR (test) | -- | 76 | |
| Semantic Textual Similarity | STS-B | Spearman's Rho (x100)77.63 | 70 | |
| Multi-hop Question Answering | HotpotQA | Exact Match (EM)44.36 | 56 | |
| Information Retrieval | BEIR v1.0.0 (test) | ArguAna63.5 | 55 | |
| Text Embedding | MTEB | MTEB Score64.23 | 45 | |
| General Question Answering | TriviaQA | Exact Match63.66 | 39 | |
| General Question Answering | NQ | Exact Match (EM)40.36 | 36 |