Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AQUA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Math ReasoningAQuA
Accuracy93.55
188
Mathematical ReasoningAQUA
Accuracy85.05
167
Algebraic ReasoningAQUA
Accuracy91.89
65
Mathematical ReasoningAQuA
AQuA Exact Match79.92
60
Arithmetic ReasoningAQuA (test)
Accuracy74.63
58
Arithmetic ReasoningAQUA
Accuracy77.1
57
Mathematical ReasoningAQuA
Accuracy87.01
45
Multiple-choice Question AnsweringAQuA
Accuracy89.45
43
Hallucination DetectionAQuA
AUROC0.7822
31
Marine species classificationAQUA20 (test)
Macro F188.9
28
Symbolic ReasoningAQUA
Accuracy80.3
26
ReasoningAQuA
CACC (%)72
25
Hybrid ReasoningAQUA (test)
Accuracy78.5
24
Mathematical ReasoningAQUA (test)
Accuracy72.44
18
Mathematical ReasoningAQuA
Accuracy (Without Verifier)74
16
Algebraic ReasoningAQuA
Performance (%)66.36
12
CoT faithfulness detectionAQuA
Accuracy (CoT Faithfulness)77
12
Complex ReasoningAQuA
Accuracy28.35
12
Mathematical ReasoningAQuA
FRS96.8
9
Mathematical ReasoningAQUA (val)
Tokens at Best Step (K)336
7
Algebraic ReasoningAQUA (test)
Accuracy30.94
6
Mathematical ReasoningAQUA
Answer Selection Rate (ASR)94.4
4
Mathematical ReasoningAQuA
Mean Accuracy93.42
3
Algebraic ReasoningAQUA
PPL22.7
3
CoT Soundness EvaluationAQuA
CSR90
3
Showing 25 of 27 rows