Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematical reasoning and calculation on TPS-CalcBench (test)
Loading...
90.2
KPI
gpt-5.2
10.0368
30.8484
51.66
72.4716
Apr 20, 2026
KPI
Delta KPI
Hallucination Rate (%)
Updated 1mo ago
Evaluation Results
Method
Method
Links
KPI
Delta KPI
Hallucination Rate (%)
gpt-5.2
mode=RAG-EQ (retrieval...
2026.04
90.2
2.4
38
gpt-5.2
mode=Base (zero-shot b...
2026.04
87.85
-
-
deepseek-v3.1
mode=RAG-EQ (retrieval...
2026.04
84.1
4.9
44
aws.claude-opus-4.5
mode=RAG-EQ (retrieval...
2026.04
83.6
5.8
52
aws.claude-sonnet-4.5
mode=RAG-EQ (retrieval...
2026.04
79.4
7.6
61
deepseek-v3.1
mode=Base (zero-shot b...
2026.04
79.22
-
-
aws.claude-opus-4.5
mode=Base (zero-shot b...
2026.04
77.8
-
-
aws.claude-sonnet-4.5
mode=Base (zero-shot b...
2026.04
71.81
-
-
gemini-3-flash-preview
mode=RAG-EQ (retrieval...
2026.04
57.3
15.2
71
gemini-3-flash-preview
mode=Base (zero-shot b...
2026.04
42.15
-
-
MiniMax-M2.5
mode=RAG-EQ (retrieval...
2026.04
28.7
15.6
68
MiniMax-M2.5
mode=Base (zero-shot b...
2026.04
13.12
-
-
Feedback
Search any
task
Search any
task