Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Logical Reasoning Verification on SAIR official benchmark (hard2)
Loading...
48
Official Accuracy
AN45c
40.72
42.61
44.5
46.39
Apr 20, 2026
Official Accuracy
Official F1 Score
Delta vs Baseline (pp)
Updated 1mo ago
Evaluation Results
Method
Method
Links
Official Accuracy
Official F1 Score
Delta vs Baseline (pp)
AN45c
Model=GPT-OSS 120B, n=20
2026.04
48
61.2
8.3
AN38
Model=GPT-OSS 120B, n=20
2026.04
41
47.8
15.3
Feedback
Search any
task
Search any
task