| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Instruction Following | AlpacaEval 2.0 | Win Rate95.87 | 507 | |
| Instruction Following | AlpacaEval | Win Rate97.2 | 227 | |
| LLM alignment evaluation | AlpacaEval 2 | LC Win Rate51.9 | 86 | |
| Instruction Following | AlpacaEval 2.0 (test) | LC Win Rate (%)67.45 | 81 | |
| Instruction Following and Helpfulness Evaluation | AlpacaEval 2.0 | Win Rate49.4 | 58 | |
| LLM Alignment Evaluation | AlpacaEval 2.0 (test) | LC Win Rate30.35 | 51 | |
| Chat | AlpacaEval 2.0 (test) | AlpacaEval (LC win %)57.46 | 46 | |
| Open-ended Generation | AlpacaEval 2.0 | Win Rate648 | 43 | |
| Open-ended | AlpacaEval | Win Rate vs Davinci-00393.5 | 40 | |
| Chat | AlpacaEval | Win Rate3,213 | 39 | |
| Instruction Following | AlpacaEval (test) | Helpfulness Score3,213 | 32 | |
| Instruction following | AlpacaEval High-Variance (Top 20%) 2.0 | Reward Score11.6 | 26 | |
| Instruction following | AlpacaEval 2.0 (Overall) | Reward11.62 | 26 | |
| General Performance | AlpacaEval | Winrate98 | 25 | |
| Safety Guardrailing | AlpacaEval | False Positive Rate0 | 24 | |
| Chat Evaluation | AlpacaEval LC 2 | Score74.11 | 23 | |
| Open-ended Generation | AlpacaEval 1.0 | Win Rate7,904 | 23 | |
| Instruction Following | AlpacaEval Yoruba | Win Rate (%)68.9 | 20 | |
| Instruction Following | AlpacaEval Swahili | Win Rate83 | 20 | |
| Instruction Following | AlpacaEval Indonesian | Win Rate64.2 | 20 | |
| Instruction Following | AlpacaEval Korean | Win Rate77.8 | 20 | |
| Instruction Following | AlpacaEval German | Win Rate65.2 | 20 | |
| Instruction Following | AlpacaEval Chinese | Win Rate70.4 | 20 | |
| LLM Evaluation | AlpacaEval | AlpacaE51.06 | 16 | |
| Instruction Following Evaluation | AlpacaEval 2 | Win Rate48.14 | 16 |