Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Safety Evaluation on WildTeaming 500-example (test)
Loading...
88.6
HarmR
Ft. Agent
53.552
62.651
71.75
80.849
Oct 19, 2025
HarmR
Help@S
Updated 26d ago
Evaluation Results
Method
Method
Links
HarmR
Help@S
Ft. Agent
Backbone=Qwen-2.5-3B-I...
2025.10
88.6
2.87
Naive RAG
Backbone=Qwen-2.5-3B-I...
2025.10
88.5
2.84
Base Agent
Backbone=Qwen-2.5-3B-I...
2025.10
87.5
2.9
Naive RAG
Backbone=Qwen-2.5-7B-I...
2025.10
87.3
2.97
Ft. Agent
Backbone=Qwen-2.5-7B-I...
2025.10
87.3
2.88
Base LLM
Backbone=Qwen-2.5-3B-I...
2025.10
87
2.84
Base Agent
Backbone=Qwen-2.5-7B-I...
2025.10
83.9
2.79
Base LLM
Backbone=Qwen-2.5-7B-I...
2025.10
81.7
2.73
Naive RAG
Backbone=Qwen-2.5-14B-...
2025.10
77.4
2.45
Base Agent
Backbone=Qwen-2.5-14B-...
2025.10
69.1
2.43
Base LLM
Backbone=Qwen-2.5-14B-...
2025.10
68.5
2.4
Base LLM
Backbone=Mistral-NeMo-...
2025.10
56.3
3.29
Naive RAG
Backbone=Mistral-NeMo-...
2025.10
54.9
3.17
Base Agent
Backbone=Mistral-NeMo-...
2025.10
54.9
3.17
Feedback
Search any
task
Search any
task