| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| SWE-Bench Verified | Mini-SWE | Pass@184 | 18 | 4d ago | |
| SWE Verified | Resolution Rate77.2 | 17 | 4d ago | ||
| SWE-bench Verified | SWE-smith-LM-32B | Resolution Rate0.402 | 9 | 4d ago | |
| SWE-Bench Pro (public) | CCA | Resolve Rate (Pass@1)59 | 9 | 4d ago | |
| SWE-rebench January 2026 (test) | Resolved Rate52.9 | 8 | 4d ago | ||
| SWE-bench Lite | TOOLSELF | Accuracy16.1 | 8 | 4d ago | |
| SWE-bench Verified | AdaptOrch | Accuracy52.6 | 7 | 4d ago | |
| SWE-Bench (val) | Acc28.8 | 7 | 4d ago | ||
| SWE-bench Verified | GEA | Worst-Case Success Rate71 | 6 | 4d ago | |
| SWE-Bench AgentLess Repair | MiMo-V2-Flash Base | Resolved Percentage30.8 | 4 | 4d ago | |
| SWE-Bench Python subset Pro | SageAgent | Resolution Rate59 | 3 | 4d ago | |
| Software Engineering | FrugalGPT | Functional Correctness0.335 | 2 | 4d ago | |
| Sphinx 64K context 44 samples | - | - | 0 | 4d ago | |
| Sympy 64K context 75 samples | - | - | 0 | 4d ago | |
| Django 64K context 231 samples | - | - | 0 | 4d ago |