Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Safety and Utility Evaluation on XSTest (S_safe, S_help, S_nat)
Loading...
9.89
Safety Score
Qwen3-max
9.3076
9.4588
9.61
9.7612
May 8, 2026
Safety Score
Helpfulness Score
Naturalness Score
Updated 23d ago
Evaluation Results
Method
Method
Links
Safety Score
Helpfulness Score
Naturalness Score
Qwen3-max
2026.05
9.89
7.81
8.35
GPT-4o
2026.05
9.84
3.58
6.64
Kimi-k2
2026.05
9.68
7.35
8.13
LANCE
Refinement Model=1.5B,...
2026.05
9.55
7.88
8.53
Gemini2.5-Pro
2026.05
9.33
2.84
4.12
Feedback
Search any
task
Search any
task