| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| BBQ | C2PO | Accuracy99.3 | 113 | 8d ago | |
| Task 2 Persona F | BAD Score0.3 | 25 | 9d ago | ||
| Task 2 Persona E | BAD Score0.017 | 25 | 9d ago | ||
| Task 2 Persona D | BAD Score0.007 | 25 | 9d ago | ||
| Task 2 Persona C | BAD Score-0.18 | 25 | 9d ago | ||
| Task 2 Persona B | BAD Score-0.007 | 25 | 9d ago | ||
| Task 2 Persona A | BAD Score-0.006 | 25 | 9d ago | ||
| KoBBQ | Ambiguous Context Score98.3 | 17 | 26d ago | ||
| BBQ averaged across gender, nationality, and religion domains | Self-Debiasing | Accuracy (Ambiguous)87.73 | 16 | 1mo ago | |
| SOCT | Llama 3.1 8B - LFT w. SH-N (baseline 3) | DR (Female Stereotype)0.054 | 15 | 1mo ago | |
| Crows-pairs | Qwen 3 0.6B - LFT w. SH (baseline 2) | Pct Stereotype51.25 | 15 | 1mo ago | |
| Honest | Llama 3.1 8B - LFT w. SH-Dgender (BaseCDA) | Honest Score11.7 | 15 | 1mo ago | |
| Reddit Bias | Llama 3.1 8B - Pretrained model (baseline 1) | t-value-4.7523 | 15 | 1mo ago | |
| Male-biased prompts | Manually curated | Male Bias (Base)0.53 | 14 | 1mo ago | |
| CrowS-Pairs | CDA | CS Score50.01 | 13 | 18d ago | |
| HolisticBias | PaCE | GN Score66.2 | 10 | 29d ago | |
| Crow-S | Score57.25 | 9 | 1mo ago | ||
| RedditBias Religion | Regard8.06 | 8 | 1mo ago | ||
| WinoGender | EBS0.068 | 8 | 1mo ago | ||
| Consolidated Evaluation Dimensions | Council Mode | Bias σ²0.003 | 6 | 12d ago | |
| CLEAR Bias | Finetuned | Age Performance82.9 | 5 | 18d ago | |
| RTP | UGID | Bias0.3 | 4 | 29d ago | |
| BBQ Gender | KLAAD | Ambiguity Score47.2 | 4 | 29d ago | |
| BOLD | CDA | Bias Score1.037 | 4 | 29d ago | |
| Human Evaluation Toxic Prompts | CAP-TTA | Biased Item Count2 | 3 | 1mo ago |