| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Cross-distribution performance | SAIR official benchmark (hard3) | Accuracy81.3 | 13 | |
| Cross-distribution performance | SAIR (hard2) | Accuracy99 | 13 | |
| Counterexample Generation | SAIR hard3 Official (n=20) | Accuracy65.3 | 3 | |
| Counterexample Generation | SAIR hard3 Local (n=400) | Accuracy79.25 | 3 | |
| Logical Reasoning Verification | SAIR official benchmark (hard2) | Official Accuracy48 | 2 |