| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Hallucination Evaluation | AMBER | F1 Score90.9 | 71 | |
| Hallucination Assessment | AMBER | CHAIR_s10.6 | 47 | |
| Hallucination Assessment | AMBER (test) | CHAIR5.6 | 38 | |
| Hallucination Detection | AMBER sampled 5k | A-ROC85.99 | 30 | |
| Generative Hallucination | AMBER Generative | CHAIR Score8.4 | 24 | |
| Generative Hallucination | AMBER generative subset | CHAIR10.9 | 22 | |
| Watermarking | AMBER | AUC99.99 | 18 | |
| Multimodal Watermarking | AMBER | PPL2.98 | 14 | |
| Multi-modal Hallucination Evaluation | AMBER | Mean Accuracy76.9 | 10 | |
| Discriminative Task | AMBER Discrimination 1.0 (test) | Accuracy76.7 | 10 | |
| Next Token Prediction | Amber 1.2T tokens | BPD4.28 | 4 |