Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-task Evaluation on Fairness and Utility Suite
Loading...
82.1
Average Score
Self-Debias Iter2 + Self-Correction
41.228
51.839
62.45
73.061
Apr 9, 2026
Average Score
Updated 8d ago
Evaluation Results
Method
Method
Links
Average Score
Self-Debias Iter2 + Self-Correction
2026.04
82.1
Self-Debias Iter1 + Self-Correction
2026.04
81.8
Self-Debias Iter2
2026.04
81.7
Self-Debias Offline + Self-Correction
2026.04
81.3
Self-Debias Iter1
2026.04
81.3
Self-Debias SFT + Self-Correction
2026.04
80.9
Self-Debias Offline
2026.04
80.8
Self-Debias SFT
2026.04
80.6
Qwen1.5-8B
2026.04
77.5
Qwen2.5-7B-Instruct
2026.04
77.4
Qwen2.5-7B-Instruct + Self-Correction
2026.04
70.9
DeepSeek-R1-Distill-Qwen-7B
2026.04
70.4
Qwen1.5-8B + Self-Correction
2026.04
64
DeepSeek-R1-Distill-Qwen-7B + Self-Correction
2026.04
63.7
Llama-3.1-8B-Instruct
2026.04
52.3
Llama-3.1-8B-Instruct + Self-Correction
2026.04
42.8
Feedback
Search any
task
Search any
task