Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
STEM Reasoning on GPQA Diamond
Loading...
70.2
Accuracy
SwiR
30.2744
40.6397
51.005
61.3703
Oct 6, 2025
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
SwiR
Backbone=Qwen3-32B
2025.10
70.2
CoT (Greedy)
Backbone=Qwen3-32B
2025.10
69.7
Soft Thinking
Backbone=Qwen3-32B
2025.10
67.17
CoT
Backbone=Qwen3-32B
2025.10
66.16
SwiR
Backbone=Qwen3-8B
2025.10
61.11
CoT
Backbone=Qwen3-8B
2025.10
59.6
Soft Thinking
Backbone=Qwen3-8B
2025.10
59.6
CoT (Greedy)
Backbone=Qwen3-8B
2025.10
56.57
SwiR
Backbone=DeepSeek-R1-D...
2025.10
47.98
CoT
Backbone=DeepSeek-R1-D...
2025.10
46.46
SwiR
Backbone=Qwen3-1.7B
2025.10
41.41
CoT
Backbone=Qwen3-1.7B
2025.10
39.39
Soft Thinking
Backbone=Qwen3-1.7B
2025.10
34.34
Soft Thinking
Backbone=DeepSeek-R1-D...
2025.10
33.33
CoT (Greedy)
Backbone=Qwen3-1.7B
2025.10
31.82
CoT (Greedy)
Backbone=DeepSeek-R1-D...
2025.10
31.81
Feedback
Search any
task
Search any
task