| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BBH | UPA | Accuracy100 | 201 | 10d ago | |
| LogiQA | Denser | LogiQA Accuracy78.9 | 181 | 18d ago | |
| LogiQA (test) | Accuracy86 | 151 | 8d ago | ||
| FOLIO | VERGE Full | Accuracy89.2 | 123 | 4d ago | |
| Sudoku | EGSPO-SA | Accuracy94.3 | 119 | 2d ago | |
| Formal Logic | Accuracy77.8 | 106 | 3d ago | ||
| LogiQA | Qwen3-8B-thinking | Accuracy80.4 | 100 | 1mo ago | |
| LogiQA | Accuracy50.23 | 98 | 1mo ago | ||
| ReClor (test) | IDOL | Accuracy80.6 | 87 | 1mo ago | |
| LogicVista | Qwen3-VL-30B A3B-Instruct | Accuracy58.2 | 84 | 15d ago | |
| ProofW | Denser | Accuracy83.7 | 80 | 1mo ago | |
| FOLIO (test) | HBLR | Accuracy95.6 | 58 | 1mo ago | |
| StrategyQA | Accuracy89 | 58 | 1mo ago | ||
| Stepgame k=10 | LLM-ASP | Accuracy88.1 | 56 | 24d ago | |
| Stepgame k=4 | LLM-ASP | Accuracy93.8 | 56 | 24d ago | |
| Stepgame k=3 | PoT-LLM | Accuracy89.5 | 56 | 24d ago | |
| CounterBench (test) | FLEx | Accuracy88.9 | 55 | 1mo ago | |
| LogiQA (val) | GPT-4-0125-preview | Accuracy58.37 | 50 | 1mo ago | |
| LogicVista | ADHint | Avg Pass@864.9 | 48 | 3d ago | |
| ZebraLogic | Accuracy98.8 | 48 | 1mo ago | ||
| HLE | Accuracy0.7226 | 46 | 4d ago | ||
| ReClor (dev) | FOCAL REASONER | Accuracy0.786 | 46 | 1mo ago | |
| AR-LSAT | VERGE Full | Accuracy91.7 | 44 | 1mo ago | |
| ProofWriter | PoT | Accuracy98.4 | 44 | 25d ago | |
| CLUTRR | DIVERSE | Accuracy95.9 | 42 | 24d ago |