Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AMBER

Benchmarks

Task NameDataset NameSOTA ResultTrend
Hallucination EvaluationAMBER
CHAIR14.2
172
Hallucination AssessmentAMBER
CHAIR_s10.6
56
Hallucination AssessmentAMBER (test)
CHAIR5.6
38
Generative HallucinationAMBER Generative
Coverage (%)70.4
36
Object Hallucination AssessmentAMBER
CHAIR_I16.2
35
Hallucination DetectionAMBER sampled 5k
A-ROC85.99
30
Object Hallucination Mitigation on Generative TasksAMBER
CHAIR12.1
22
Multi-modal Hallucination EvaluationAMBER
Mean Accuracy89.79
22
Generative HallucinationAMBER generative subset
CHAIR10.9
22
WatermarkingAMBER
AUC99.99
18
Generative Hallucination EvaluationAMBER
Score90.79
14
Multimodal WatermarkingAMBER
PPL2.98
14
Discriminative Hallucination EvaluationAMBER
Accuracy84.3
12
Hallucination Evaluation (Generative)AMBER-g
CHAIR Score4.5
12
Hallucination Evaluation (Discriminative)AMBER-d
Accuracy89.2
12
Discriminative Hallucination DetectionAMBER
Accuracy89.4
10
Discriminative TaskAMBER Discrimination 1.0 (test)
Accuracy76.7
10
Text Fluency EvaluationAMBER
PPL112.5
9
Discriminative Hallucination EvaluationAMBER Discriminative
F1 Score90.3
9
Object Hallucination DetectionAMBER out-of-distribution (OOD)
AUC0.8611
8
Discriminative TaskAMBER
Accuracy84.3
4
Next Token PredictionAmber 1.2T tokens
BPD4.28
4
Object Hallucination EvaluationAMBER (test)
Accuracy7.28
2
Showing 23 of 23 rows