Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Knowledge Reasoning on GPQA Diamond
Loading...
75.4
Accuracy (avg@8)
TRICE-30B
35.152
45.601
56.05
66.499
Dec 18, 2025
Jan 10, 2026
Feb 2, 2026
Feb 26, 2026
Mar 21, 2026
Apr 13, 2026
May 7, 2026
Accuracy (avg@8)
Updated 26d ago
Evaluation Results
Method
Method
Links
Accuracy (avg@8)
TRICE-30B
Use Tool=true
2026.05
75.4
Qwen3-30B-A3B-Thinking-2507
Use Tool=false
2026.05
71.2
TRICE-4B
Use Tool=true
2026.05
68.8
Qwen3-4B-Thinking-2507
Use Tool=false
2026.05
64.4
Qwen3
Params=32B
2026.04
64.1
I-DLM
Params=32B, Decoding=I...
2026.04
62.1
Qwen3
Params=8B
2026.04
58.9
I-DLM
Params=8B, Decoding=IS...
2026.04
55.6
DeepSeek-R1-Distill-Qwen-7B
Architecture=Dense, #...
2025.12
47.1
Sigma-MoE-Tiny
Architecture=MoE, # Ac...
2025.12
46.4
LLaDA-2.1-mini
Params=16B
2026.04
46
DeepSeek-R1-Distill-Llama-8B
Architecture=Dense, #...
2025.12
43.2
SDAR
Params=8B
2026.04
40.2
Qwen3-1.7B
Architecture=Dense, #...
2025.12
40.1
Phi-3.5-MoE
Architecture=MoE, # Ac...
2025.12
36.8
SDAR
Params=30B
2026.04
36.7
Feedback
Search any
task
Search any
task