Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Theorem Proving on DeepTheorem
Loading...
54
False Rate
DeepSeek-V3.2-Thinking (Agentic)
52.96
59.98
67
74.02
Jan 24, 2026
False Rate
Precision
Recall Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
False Rate
Precision
Recall Rate
DeepSeek-V3.2-Thinking (Agentic)
2026.01
54
18
18
GPT-5.2 (Agentic)
2026.01
58
16
14
DeepSeek-V3.2 (Agentic)
2026.01
62
22
22
Gemini-3-Flash
Thinking level=low
2026.01
72
32
26
Gemini-3-Pro
Thinking level=low
2026.01
76
42
36
Qwen-Max (Agentic)
2026.01
76
22
22
Baseline
2026.01
76
20
16
Claude-Sonnet-4.5 (Agentic)
2026.01
80
36
32
Feedback
Search any
task
Search any
task