| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Bias Evaluation | BBQ | Accuracy99.3 | 113 | |
| Question Answering | BBQ Gender | Accuracy82.4 | 36 | |
| Question-Answering | BBQ | Accuracy92.12 | 36 | |
| Reasoning | BBQ (test) | Accuracy (Reasoning BBQ)97.5 | 32 | |
| Question Answering | BBQ (test) | Accuracy (amb)98.86 | 20 | |
| Question Answering | BBQ Race | Accuracy81.9 | 18 | |
| Question Answering | BBQ Nationality | Accuracy82.1 | 18 | |
| Bias Evaluation | BBQ averaged across gender, nationality, and religion domains | Accuracy (Ambiguous)87.73 | 16 | |
| Question Answering | BBQ (Bias Benchmark for QA) v1.0 (test) | BBQ SES Score93.1 | 16 | |
| Bias Mitigation | BBQ SingleTurn | Age Bias16.3 | 12 | |
| Question Answering | BBQ | Disambiguation TOP-183.93 | 12 | |
| Question Answering | D_BBQ | Accuracy99.5 | 8 | |
| Question Answering | BBQ Overall Llama-3 | Accuracy80.7 | 6 | |
| Question Answering | BBQ disambiguated questions | Accuracy93 | 5 | |
| Question Answering | BBQ (ambiguous) | Accuracy95 | 5 | |
| Question Answering Bias Evaluation | BBQ | Accuracy (All)79 | 5 | |
| Bias Evaluation | BBQ Gender | Ambiguity Score47.2 | 4 | |
| Bias QA | BBQ Ambig | Accuracy85.04 | 4 | |
| Bias QA | BBQ Disambig | Accuracy84.85 | 2 | |
| Bias Evaluation | BBQ Disambiguated | Bias Score Before90.07 | 1 |