Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AQUA

Benchmarks

Task NameDataset NameSOTA ResultTrend
Mathematical ReasoningAQUA
Accuracy85.05
146
Math ReasoningAQuA
Accuracy91.8
78
Algebraic ReasoningAQUA
Accuracy91.89
61
Mathematical ReasoningAQuA
AQuA Exact Match79.92
60
Arithmetic ReasoningAQuA (test)
Accuracy74.63
58
Hallucination DetectionAQuA
AUROC0.7822
31
Multiple-choice Question AnsweringAQuA
Accuracy87.4
31
Arithmetic ReasoningAQUA
Accuracy77.1
31
Marine species classificationAQUA20 (test)
Macro F188.9
28
Symbolic ReasoningAQUA
Accuracy80.3
26
ReasoningAQuA
CACC (%)72
25
Mathematical ReasoningAQuA
Accuracy (Without Verifier)74
16
Mathematical ReasoningAQuA
FRS96.8
9
Mathematical ReasoningAQUA (val)
Tokens at Best Step (K)336
7
Mathematical ReasoningAQUA (test)
Accuracy72.44
6
CoT Soundness EvaluationAQuA
CSR90
3
CoT NaturalnessAQuA
PPL21.34
3
Arithmetic ReasoningAQUA
Accuracy (format-specific prompt)33.5
2
Algebraic ReasoningAQUA (test)
Accuracy-
0
Showing 19 of 19 rows