| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Instruction Following | AlpacaEval 2.0 | Win Rate95.87 | 722 | |
| Instruction Following | AlpacaEval | Win Rate98.4 | 420 | |
| Instruction Following | AlpacaEval 2 | LC (%)75.4 | 137 | |
| Instruction Following | AlpacaEval 2.0 (test) | LC Win Rate (%)67.45 | 95 | |
| LLM alignment evaluation | AlpacaEval 2 | LC Win Rate51.9 | 89 | |
| Instruction Following | AlpacaEval (test) | Helpfulness Score3,213 | 65 | |
| Chat | AlpacaEval 2.0 (test) | AlpacaEval (LC win %)57.46 | 58 | |
| Instruction Following and Helpfulness Evaluation | AlpacaEval 2.0 | Win Rate49.4 | 58 | |
| LLM Alignment Evaluation | AlpacaEval 2.0 (test) | LC Win Rate30.35 | 51 | |
| Instruction Following | AlpacaEval LC 2 | Win Rate80.9 | 49 | |
| Preference Evaluation | AlpacaEval 2 | WR (%)559 | 48 | |
| Open-ended Generation | AlpacaEval 2.0 | Win Rate648 | 43 | |
| Open-ended | AlpacaEval | Win Rate vs Davinci-00393.5 | 40 | |
| Chat | AlpacaEval | Win Rate3,213 | 39 | |
| Pairwise evaluation | AlpacaEval | Human Agreement72.4 | 37 | |
| Dialogue | AlpacaEval 2 | AlpacaEval2 Score64.2 | 34 | |
| Instruction Following | AlpacaEval Length-controlled | Score73.9 | 34 | |
| Predictive LLM Routing | AlpacaEval | Score (vs OpenAI)63.17 | 26 | |
| Instruction following | AlpacaEval High-Variance (Top 20%) 2.0 | Reward Score11.6 | 26 | |
| Instruction following | AlpacaEval 2.0 (Overall) | Reward11.62 | 26 | |
| LLM Alignment | AlpacaEval 2.0 | LC Win Rate61.52 | 25 | |
| General Performance | AlpacaEval | Winrate98 | 25 | |
| Safety Guardrailing | AlpacaEval | False Positive Rate0 | 24 | |
| LLM Alignment | AlpacaEval | Win Rate25.24 | 24 | |
| Chat Evaluation | AlpacaEval LC 2 | Score74.11 | 23 |