Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SAIR

Benchmarks

Task NameDataset NameSOTA ResultTrend
Cross-distribution performanceSAIR official benchmark (hard3)
Accuracy81.3
13
Cross-distribution performanceSAIR (hard2)
Accuracy99
13
Counterexample GenerationSAIR hard3 Official (n=20)
Accuracy65.3
3
Counterexample GenerationSAIR hard3 Local (n=400)
Accuracy79.25
3
Logical Reasoning VerificationSAIR official benchmark (hard2)
Official Accuracy48
2
Showing 5 of 5 rows