| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Aggregate K&R, IFEval-PT, HumanEval | Tucano2-qwen-3.7B-Instruct | Average Score53.64 | 14 | 3mo ago | |
| Aggregate IFEval, IFBench, Arena-Hard-v2.0, Creative Writing v3, WritingBench | Hybrid Reward | Average Score71.9 | 11 | 5d ago | |
| General Capability Suite (MMLU, GSM8K, GPQA) | MUSE-D | MMLU Accuracy73.6 | 5 | 22d ago | |
| BIG-bench 57 Task | GAL 120B | Accuracy (Weighted)48.7 | 5 | 3mo ago |