Share your thoughts, 1 month free Claude Pro on usSee more

Multi-task Evaluation on Fairness and Utility Suite

82.1Average Score

Self-Debias Iter2 + Self-Correction

Updated 1mo ago

Evaluation Results

Method	Links
Self-Debias Iter2 + Self-Correction 2026.04		82.1
Self-Debias Iter1 + Self-Correction 2026.04		81.8
Self-Debias Iter2 2026.04		81.7
Self-Debias Offline + Self-Correction 2026.04		81.3
Self-Debias Iter1 2026.04		81.3
Self-Debias SFT + Self-Correction 2026.04		80.9
Self-Debias Offline 2026.04		80.8
Self-Debias SFT 2026.04		80.6
Qwen1.5-8B 2026.04		77.5
Qwen2.5-7B-Instruct 2026.04		77.4
Qwen2.5-7B-Instruct + Self-Correction 2026.04		70.9
DeepSeek-R1-Distill-Qwen-7B 2026.04		70.4
Qwen1.5-8B + Self-Correction 2026.04		64
DeepSeek-R1-Distill-Qwen-7B + Self-Correction 2026.04		63.7
Llama-3.1-8B-Instruct 2026.04		52.3
Llama-3.1-8B-Instruct + Self-Correction 2026.04		42.8