Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Bamboogle

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-Hop Question AnsweringBamboogle
Exact Match48
97
Question AnsweringBamboogle
EM60
62
Multi-hop Question AnsweringBamboogle
Accuracy75.2
52
Multi-hop Question AnsweringBamboogle (test)
EM57.6
46
Multi-Hop Question AnsweringBamboogle
EM32.23
37
Error DetectionBamboogle Full
Precision100
36
Error DetectionBamboogle
F1 Score0.94
36
Multi-Hop Question AnsweringBamboogle
F161.6
25
Confidence Calibration in Retrieval-Augmented GenerationBamboogle k=5 OOD (test)
ECE0.065
24
CalibrationBamboogle
ECE0.113
24
Question AnsweringBamboogle (test)
EM (%)35.5
18
Knowledge-Intensive ReasoningBamboogle
F173.8
18
Multi-Hop Question AnsweringBamboogle
EM42.4
18
Question AnsweringBamboogle
Cover Exact Match62.4
18
Multi-Hop Question AnsweringBamboogle out-of-domain (val test)
Exact Match (EM)56.4
15
Multi-hop Question AnsweringBamboogle out-of-domain (test)
Accuracy (R)68.8
14
Question AnsweringBamboogle 500 samples (val)
EM34.6
14
Agentic SearchBamboogle
LJFT Score64.8
12
Compositional multi-hop QABamboogle
Success Rate77.6
12
Multi-Hop Question AnsweringBamboogle (out-of-domain)
Accuracy73.8
10
Question AnsweringBamboogle multi-hop (test)
Avg@1640.1
10
Multi-step ReasoningBamboogle auto-eval (test)
Mean Accuracy76.1
10
Multi-hop QABamboogle
EM56
9
Question AnsweringBamboogle
ECE0.521
8
Multi-hop Open-domain Question AnsweringBamboogle
Accuracy77.3
6
Showing 25 of 31 rows