Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
General Language Capabilities on MMLU, GSM8K, GPQA, HumanEval, TruthfulQA, IFEval Aggregate
Loading...
71.2
Average Score
GRPO
62.048
64.424
66.8
69.176
May 26, 2025
Average Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
Average Score
GRPO
Backbone=LLaMA-3.1-8B
2025.05
71.2
TI-DPO
Backbone=LLaMA-3.1-8B
2025.05
71.1
TPO
Backbone=LLaMA-3.1-8B
2025.05
70
CPO
Backbone=LLaMA-3.1-8B
2025.05
68.9
KTO
Backbone=LLaMA-3.1-8B
2025.05
68
DPO
Backbone=LLaMA-3.1-8B
2025.05
66.8
TDPO
Backbone=LLaMA-3.1-8B
2025.05
65.8
SFT
Backbone=LLaMA-3.1-8B
2025.05
65.2
IPO
Backbone=LLaMA-3.1-8B
2025.05
62.5
SIMPO
Backbone=LLaMA-3.1-8B
2025.05
62.4
Feedback
Search any
task
Search any
task