Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-turn conversation performance on Math
Loading...
94.5
Avg Performance
Full
63.716
71.708
79.7
87.692
Feb 7, 2026
Avg Performance
Reliability
Updated 1mo ago
Evaluation Results
Method
Method
Links
Avg Performance
Reliability
Full
Model=GPT-5.2, Setting...
2026.02
94.5
89.4
Full
Model=DeepSeek-v3.2-Th...
2026.02
94
81.6
Full
Model=GPT-4o-mini, Set...
2026.02
87.2
70.9
Experience-Driven Mediator
Model=DeepSeek-v3.2-Th...
2026.02
86.3
67.3
Experience-Driven Mediator
Model=GPT-5.2, Setting...
2026.02
80.6
62
Sharded
Model=DeepSeek-v3.2-Th...
2026.02
78.8
56.3
Experience-Driven Mediator
Model=GPT-4o-mini, Set...
2026.02
77.7
70.4
Sharded
Model=GPT-5.2, Setting...
2026.02
69.6
48.6
Sharded
Model=GPT-4o-mini, Set...
2026.02
64.9
45.6
Feedback
Search any
task
Search any
task