| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| HumanEval | CompassMax-V3-Thinking | Pass@198.17 | 168 | 6d ago | |
| HumanEval+ | Pass@195.12 | 164 | 18h ago | ||
| MBPP | Accuracy98.4 | 145 | 9d ago | ||
| MBPP+ | Pass@197.88 | 117 | 5d ago | ||
| MBPP | SwiR | Pass@1 Accuracy95.33 | 78 | 26d ago | |
| HumanEval | DMoA | Accuracy95.62 | 60 | 14d ago | |
| Coding Tasks (test) | SALE | Pass@198.3 | 42 | 3mo ago | |
| LiveCodeBench | RSA | Accuracy70 | 38 | 18h ago | |
| MBPP | Overall Average Score81 | 37 | 23h ago | ||
| HumanEval, MBPP | D3 | HumanEval Score50.2 | 35 | 1d ago | |
| HumanEval | Ministral-3-R | HumanEval Mean Score0.9695 | 32 | 2mo ago | |
| LiveCodeBench v6 | Mellum 2 (SFT) | Score (%)75.1 | 31 | 1d ago | |
| MultiPL-E | Score87.9 | 31 | 7d ago | ||
| HumanEval (test) | Test Accuracy74.4 | 30 | 26d ago | ||
| LiveCodeBench | EqLen-GRPO | Acc (avg@32)74.2 | 29 | 22d ago | |
| LiveCodeBench v5 | Qwen3-235B-A22B-R-TAP | Accuracy77.6 | 29 | 3mo ago | |
| HumanEval | HumanEval79.9 | 28 | 11d ago | ||
| Coding Suite EvalPlus & LiveCodeBench | Eval+ Score86.7 | 26 | 2mo ago | ||
| WildChat 5,000 conversations | KL Divergence (Forward)0.26 | 24 | 22d ago | ||
| LiveCodeBench | Task Accuracy79 | 23 | 3mo ago | ||
| MBPP | Ministral-3-R | Score94.16 | 23 | 11d ago | |
| LiveCode | REAP | LiveCode Score41.2 | 22 | 2mo ago | |
| Eval+ | Eval+ Score81.4 | 22 | 2mo ago | ||
| Coverage (test) | GPT4o | Precision94.57 | 21 | 3mo ago | |
| HEval | Qwen-2.5-7B | Accuracy84.8 | 20 | 6d ago |