Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AMBER

Benchmarks

Task NameDataset NameSOTA ResultTrend
Hallucination EvaluationAMBER
CHAIR24.5
222
Generative HallucinationAMBER Generative
Coverage (%)70.4
81
Hallucination AssessmentAMBER
CHAIR_s10.6
56
Object Hallucination Mitigation on Generative TasksAMBER
CHAIR12.1
38
Hallucination AssessmentAMBER (test)
CHAIR5.6
38
Object Hallucination AssessmentAMBER
CHAIR_I16.2
35
Hallucination DetectionAMBER sampled 5k
A-ROC85.99
30
Hallucination Evaluation (Generative)AMBER-g
CHAIR Score2.2
29
Multi-modal Hallucination EvaluationAMBER
CHAIR9.2
28
Hallucination EvaluationAMBER Generative Task
Coverage67.1
26
Action-relation hallucination evaluationAMBER Relation
Accuracy81.25
25
Discriminative Hallucination EvaluationAMBER-d
F1 Score89.5
23
Discriminative Object HallucinationAMBER Discriminative Task
F1 Score87.4
22
Generative HallucinationAMBER generative subset
CHAIR10.9
22
Discriminative Hallucination EvaluationAMBER (test)
Accuracy86.8
18
Generative Hallucination EvaluationAMBER (test)
CHAIR Score7.9
18
Discriminative Hallucination EvaluationAMBER
Accuracy84.3
18
WatermarkingAMBER
AUC99.99
18
Generative Hallucination EvaluationAMBER
Score90.79
14
Multimodal WatermarkingAMBER
PPL2.98
14
Hallucination Evaluation (Discriminative)AMBER-d
Accuracy89.2
12
Discriminative Hallucination DetectionAMBER
Accuracy89.4
10
Discriminative TaskAMBER Discrimination 1.0 (test)
Accuracy76.7
10
Text Fluency EvaluationAMBER
PPL112.5
9
Discriminative Hallucination EvaluationAMBER Discriminative
F1 Score90.3
9
Showing 25 of 32 rows