| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Evaluation | AMBER | CHAIR14.2 | 172 | |
| Hallucination Assessment | AMBER | CHAIR_s10.6 | 56 | |
| Hallucination Assessment | AMBER (test) | CHAIR5.6 | 38 | |
| Generative Hallucination | AMBER Generative | Coverage (%)70.4 | 36 | |
| Object Hallucination Assessment | AMBER | CHAIR_I16.2 | 35 | |
| Hallucination Detection | AMBER sampled 5k | A-ROC85.99 | 30 | |
| Object Hallucination Mitigation on Generative Tasks | AMBER | CHAIR12.1 | 22 | |
| Multi-modal Hallucination Evaluation | AMBER | Mean Accuracy89.79 | 22 | |
| Generative Hallucination | AMBER generative subset | CHAIR10.9 | 22 | |
| Watermarking | AMBER | AUC99.99 | 18 | |
| Generative Hallucination Evaluation | AMBER | Score90.79 | 14 | |
| Multimodal Watermarking | AMBER | PPL2.98 | 14 | |
| Discriminative Hallucination Evaluation | AMBER | Accuracy84.3 | 12 | |
| Hallucination Evaluation (Generative) | AMBER-g | CHAIR Score4.5 | 12 | |
| Hallucination Evaluation (Discriminative) | AMBER-d | Accuracy89.2 | 12 | |
| Discriminative Hallucination Detection | AMBER | Accuracy89.4 | 10 | |
| Discriminative Task | AMBER Discrimination 1.0 (test) | Accuracy76.7 | 10 | |
| Text Fluency Evaluation | AMBER | PPL112.5 | 9 | |
| Discriminative Hallucination Evaluation | AMBER Discriminative | F1 Score90.3 | 9 | |
| Object Hallucination Detection | AMBER out-of-distribution (OOD) | AUC0.8611 | 8 | |
| Discriminative Task | AMBER | Accuracy84.3 | 4 | |
| Next Token Prediction | Amber 1.2T tokens | BPD4.28 | 4 | |
| Object Hallucination Evaluation | AMBER (test) | Accuracy7.28 | 2 |