| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning | LiveBench Reasoning | Accuracy92 | 80 | |
| General Reasoning | LiveBench | Accuracy53.47 | 50 | |
| Ensemble Committee Selection | LiveBench (test) | Mean θtest99.07 | 34 | |
| Code Generation | LiveBench (test) | Sig. Score56.5 | 26 | |
| Reasoning | LiveBench | Accuracy22.3 | 25 | |
| General LLM Benchmarking | LiveBench | Official Score49.6 | 24 | |
| Code Generation | LiveBench | Avg@842.9 | 22 | |
| Code Generation | LiveBench | Signal58.7 | 21 | |
| Mathematical Reasoning | LiveBench Math | Initial Task Score58.1 | 16 | |
| Reasoning | LiveBench | Accuracy33 | 16 | |
| General Evaluation | LiveBench | Accuracy46.83 | 15 | |
| Coding | LiveBench | Accuracy40.23 | 15 | |
| Mathematical Reasoning | LiveBench | Accuracy53.6 | 12 | |
| Single-event Scene Revisit (Different Pose) | LiveBench | DINO Feature Similarity (FG)0.691 | 8 | |
| Single-event Scene Revisit (Same Pose) | LiveBench | PSNR (Background)20.132 | 8 | |
| Instruct Following | LiveBench | Average Instruction Following Score55.39 | 6 | |
| General Evaluation | LiveBench 1125 | Score52.1 | 6 | |
| General Tasks | LiveBench 2024-11-25 | Accuracy75.9 | 5 | |
| Mathematical Reasoning | LiveBench Math (test) | Score51.95 | 5 | |
| Examination | LiveBench 2024-11-25 | Score70.79 | 5 | |
| General Tasks | LiveBench 0831 | Accuracy0.57 | 5 | |
| Reasoning | LiveBench (test) | Accuracy18.15 | 3 |