Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Science Reasoning on GPQA Diamond
Loading...
71.6
AUCOAA
Format-Adaptive-Answer
39.568
47.884
56.2
64.516
Jan 6, 2026
AUCOAA
Updated 3d ago
Evaluation Results
Method
Method
Links
AUCOAA
Format-Adaptive-Answer
Backbone=Qwen3-8B
2026.01
71.6
Base model
Backbone=Qwen3-8B
2026.01
69.4
Hard-Length 16k
Backbone=Qwen3-8B
2026.01
68.9
Adaptive-Answer
Backbone=Qwen3-8B
2026.01
68.8
SFT
Backbone=Qwen3-8B
2026.01
68.7
TWYN
Backbone=Qwen3-8B
2026.01
67.1
Hard-Length 8k
Backbone=Qwen3-8B
2026.01
63.5
Soft-Length
Backbone=Qwen3-8B
2026.01
62.9
Hard-Length 8k → 4k
Backbone=Qwen3-8B
2026.01
61
Normalized-Length
Backbone=Qwen3-8B
2026.01
58.3
No-Thinking
Backbone=Qwen3-8B
2026.01
40.8
Feedback
Search any
task
Search any
task