Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Utility Evaluation on MMLU (pass@1 Accuracy)
Loading...
78
Accuracy (pass@1)
SafeDecoding
28.08
41.04
54
66.96
Aug 6, 2025
Accuracy (pass@1)
Updated 27d ago
Evaluation Results
Method
Method
Links
Accuracy (pass@1)
SafeDecoding
Backbone=R1-Llama-8B
2025.08
78
ReasoningGuard
Backbone=R1-Llama-8B
2025.08
74
No Defense
Backbone=R1-Llama-8B
2025.08
73
SafeKey
Backbone=R1-Llama-8B
2025.08
73
Self-Reminder
Backbone=R1-Llama-8B
2025.08
73
ThinkingI
Backbone=R1-Llama-8B
2025.08
73
SAFEPATH-FT
Backbone=R1-Llama-8B
2025.08
72
No Defense
Backbone=R1-Qwen-7B
2025.08
69
SafeDecoding
Backbone=R1-Qwen-7B
2025.08
69
RealSafe-R1
Backbone=R1-Qwen-7B
2025.08
69
SAFEPATH-ZS
Backbone=R1-Llama-8B
2025.08
69
SafeKey
Backbone=R1-Qwen-7B
2025.08
67
Self-Reminder
Backbone=R1-Qwen-7B
2025.08
67
ThinkingI
Backbone=R1-Qwen-7B
2025.08
66
RealSafe-R1
Backbone=R1-Llama-8B
2025.08
66
ReasoningGuard
Backbone=R1-Qwen-7B
2025.08
65
SAFEPATH-ZS
Backbone=R1-Qwen-7B
2025.08
63
SAFEPATH-FT
Backbone=R1-Qwen-7B
2025.08
59
SmoothLLM
Backbone=R1-Qwen-7B
2025.08
43
SmoothLLM
Backbone=R1-Llama-8B
2025.08
42
Paraphrase
Backbone=R1-Llama-8B
2025.08
38
Paraphrase
Backbone=R1-Qwen-7B
2025.08
30
Feedback
Search any
task
Search any
task