Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Scientific Reasoning on GPQA-D (test)
Loading...
68.7
Accuracy
Think
25.124
36.437
47.75
59.063
Feb 7, 2026
Accuracy
Token Count
Latency
Updated 25d ago
Evaluation Results
Method
Method
Links
Accuracy
Token Count
Latency
Think
Base Model=Qwen3-4B-Th...
2026.02
68.7
9,041
325.8
SpecExit
Base Model=Qwen3-4B-Th...
2026.02
68.7
7,011
137
EAGLE3
Base Model=Qwen3-4B-Th...
2026.02
67.7
8,975
212.2
NoThink*
Base Model=Qwen3-4B-Th...
2026.02
67.2
8,833
276.8
DEER
Base Model=Qwen3-4B-Th...
2026.02
67.2
9,053
505.2
SpecExit
Base Model=DeepSeek-R1...
2026.02
46
6,849
307.5
EAGLE3
Base Model=DeepSeek-R1...
2026.02
43.9
8,749
420.1
Vanilla
Base Model=DeepSeek-R1...
2026.02
43.6
8,857
574
DEER
Base Model=DeepSeek-R1...
2026.02
40.9
8,492
521.5
NoThink
Base Model=DeepSeek-R1...
2026.02
26.8
1,200
166.6
Feedback
Search any
task
Search any
task