| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Evaluation | AMBER | CHAIR24.5 | 222 | |
| Generative Hallucination | AMBER Generative | Coverage (%)70.4 | 81 | |
| Hallucination Assessment | AMBER | CHAIR_s10.6 | 56 | |
| Object Hallucination Mitigation on Generative Tasks | AMBER | CHAIR12.1 | 38 | |
| Hallucination Assessment | AMBER (test) | CHAIR5.6 | 38 | |
| Object Hallucination Assessment | AMBER | CHAIR_I16.2 | 35 | |
| Hallucination Detection | AMBER sampled 5k | A-ROC85.99 | 30 | |
| Hallucination Evaluation (Generative) | AMBER-g | CHAIR Score2.2 | 29 | |
| Multi-modal Hallucination Evaluation | AMBER | CHAIR9.2 | 28 | |
| Hallucination Evaluation | AMBER Generative Task | Coverage67.1 | 26 | |
| Action-relation hallucination evaluation | AMBER Relation | Accuracy81.25 | 25 | |
| Discriminative Hallucination Evaluation | AMBER-d | F1 Score89.5 | 23 | |
| Discriminative Object Hallucination | AMBER Discriminative Task | F1 Score87.4 | 22 | |
| Generative Hallucination | AMBER generative subset | CHAIR10.9 | 22 | |
| Discriminative Hallucination Evaluation | AMBER (test) | Accuracy86.8 | 18 | |
| Generative Hallucination Evaluation | AMBER (test) | CHAIR Score7.9 | 18 | |
| Discriminative Hallucination Evaluation | AMBER | Accuracy84.3 | 18 | |
| Watermarking | AMBER | AUC99.99 | 18 | |
| Generative Hallucination Evaluation | AMBER | Score90.79 | 14 | |
| Multimodal Watermarking | AMBER | PPL2.98 | 14 | |
| Hallucination Evaluation (Discriminative) | AMBER-d | Accuracy89.2 | 12 | |
| Discriminative Hallucination Detection | AMBER | Accuracy89.4 | 10 | |
| Discriminative Task | AMBER Discrimination 1.0 (test) | Accuracy76.7 | 10 | |
| Text Fluency Evaluation | AMBER | PPL112.5 | 9 | |
| Discriminative Hallucination Evaluation | AMBER Discriminative | F1 Score90.3 | 9 |