Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Knowledge & Instruction Following on GPQA Diamond (Score)
Loading...
80.3
Score
Qwen3.5-4B (released)
34.956
46.728
58.5
70.272
Jun 1, 2026
Score
Updated 22h ago
Evaluation Results
Method
Method
Links
Score
Qwen3.5-4B (released)
Model Size=4B, Trainin...
2026.06
80.3
Qwen3.5-9B (released)
Model Size=9B, Trainin...
2026.06
80.3
Qwen3.5-9B
Params=9B, Sampling mo...
2026.06
80.3
Mercury-2
Params=Commercial, Sam...
2026.06
73
Qwen3.5-9B + AR-SFT
Model Size=9B, Trainin...
2026.06
71.21
FLARE-9B
Model Size=9B, Trainin...
2026.06
71.21
FLARE-9B
Params=9B, Sampling mo...
2026.06
71.21
LLaDA-2.1-flash
Params=100B-A5B, Sampl...
2026.06
66.67
FLARE-9B
Params=9B, Sampling mo...
2026.06
64.65
FLARE-4B
Model Size=4B, Trainin...
2026.06
63.64
Qwen3.5-2B (released)
Model Size=2B, Trainin...
2026.06
62.12
LLaDA-2.0-flash
Params=100B-A5B, Sampl...
2026.06
61.98
Qwen3.5-4B + AR-SFT
Model Size=4B, Trainin...
2026.06
61.11
LLaDA-2.1-mini
Params=16B-A1B, Sampli...
2026.06
48.36
Qwen3.5-2B + AR-SFT
Model Size=2B, Trainin...
2026.06
47.98
LLaDA-2.0-mini
Params=16B-A1B, Sampli...
2026.06
47.98
FLARE-2B
Model Size=2B, Trainin...
2026.06
37.37
SDAR 30B-A3B
Params=30B-A3B, Sampli...
2026.06
36.7
Feedback
Search any
task
Search any
task