Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Theorem Proving on Small-scale benchmark Overall
Loading...
33
VR
Gemini-3-Pro
12.2
17.6
23
28.4
Jan 24, 2026
VR
Updated 4d ago
Evaluation Results
Method
Method
Links
VR
Gemini-3-Pro
Thinking level=low, Ex...
2026.01
33
Claude-Sonnet-4.5 (Agentic)
Expert Calls=2072, Tok...
2026.01
29
Gemini-3-Flash
Thinking level=low, Ex...
2026.01
23
DeepSeek-V3.2 (Agentic)
Expert Calls=1100, Tok...
2026.01
18
Qwen-Max (Agentic)
Expert Calls=796, Toke...
2026.01
16
Baseline
Expert Calls=736
2026.01
15
GPT-5.2 (Agentic)
Expert Calls=1372, Tok...
2026.01
14
DeepSeek-V3.2-Thinking (Agentic)
Expert Calls=900, Toke...
2026.01
13
Feedback
Search any
task
Search any
task