Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Science Reasoning on GPQA Diamond (Pass@1)
Loading...
91.9
Pass@1
Gemini-3.0 Pro
7.868
29.684
51.5
73.316
Dec 2, 2025
Dec 4, 2025
Dec 6, 2025
Dec 8, 2025
Dec 10, 2025
Dec 12, 2025
Dec 15, 2025
Pass@1
Updated 3d ago
Evaluation Results
Method
Method
Links
Pass@1
Gemini-3.0 Pro
temperature=1
2025.12
91.9
GPT-5 High
temperature=1
2025.12
85.7
Kimi-K2
thinking mode=true
2025.12
84.5
Claude-4.5-Sonnet
temperature=1
2025.12
83.4
DeepSeek-V3.2
thinking mode=true
2025.12
82.4
MiniMax M2
2025.12
77.7
TRAPO
Training Paradigm=Semi...
2025.12
43.9
Fully Supervised
Training Paradigm=Supe...
2025.12
40.4
Fully Supervised
Training Paradigm=Supe...
2025.12
39.9
TRAPO
Training Paradigm=Semi...
2025.12
37.9
Fully Supervised
Training Paradigm=Supe...
2025.12
36.4
TTRL
Training Paradigm=Unsu...
2025.12
35.4
TTRL
Training Paradigm=Semi...
2025.12
35.4
Sentence-level Entropy
Training Paradigm=Semi...
2025.12
33.8
Token-level Entropy
Training Paradigm=Unsu...
2025.12
33.3
Sentence-level Entropy
Training Paradigm=Unsu...
2025.12
32.3
Token-level Entropy
Training Paradigm=Semi...
2025.12
32.3
Self-certainty
Training Paradigm=Unsu...
2025.12
30.8
Self-certainty
Training Paradigm=Semi...
2025.12
30.3
Qwen-Instruct
Training Paradigm=Orig...
2025.12
24.7
Qwen-Base
Training Paradigm=Orig...
2025.12
11.1
Feedback
Search any
task
Search any
task