Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Scientific Reasoning on GPQA (avg.@8)
Loading...
33.27
Avg.@8
PSFTwarm-up
17.5764
21.6507
25.725
29.7993
Aug 25, 2025
Avg.@8
Updated 4d ago
Evaluation Results
Method
Method
Links
Avg.@8
PSFTwarm-up
Backbone=Qwen2.5-7B-In...
2025.08
33.27
PSFT
Backbone=Qwen2.5-7B-In...
2025.08
33.21
SFT-KL
Backbone=Qwen2.5-7B-In...
2025.08
32.95
SFT
Backbone=Qwen2.5-7B-In...
2025.08
32.89
Base
Backbone=Qwen2.5-7B-In...
2025.08
31.38
PSFT
Backbone=Llama3.1-8B-I...
2025.08
26.89
Base
Backbone=Llama3.1-8B-I...
2025.08
24.62
PSFTwarm-up
Backbone=Llama3.1-8B-I...
2025.08
23.99
SFT
Backbone=Llama3.1-8B-I...
2025.08
19.38
SFT-KL
Backbone=Llama3.1-8B-I...
2025.08
18.18
Feedback
Search any
task
Search any
task