Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bamboogle

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multi-Hop Question AnsweringBamboogle
Exact Match56
128
Question AnsweringBamboogle
EM60
120
Multi-hop Question AnsweringBamboogle (test)
EM57.6
84
Multi-hop Question AnsweringBamboogle
Accuracy75.2
62
Multi-Hop Question AnsweringBamboogle
EM78.4
51
ReasoningBamboogle
Accuracy73
46
Question AnsweringBamboogle
EM Accuracy (%)48
45
Error DetectionBamboogle Full
Precision100
36
Error DetectionBamboogle
F1 Score0.94
36
Multi-Hop Question AnsweringBamboogle (test)
Exact Match (EM)74.2
33
Multi-hop QABamboogle
EM56
27
Multi-Hop QABamboogle
Accuracy (%)74.9
25
Multi-Hop Question AnsweringBamboogle
F161.6
25
Confidence Calibration in Retrieval-Augmented GenerationBamboogle k=5 OOD (test)
ECE0.065
24
CalibrationBamboogle
ECE0.113
24
Question AnsweringBamboogle (test)
EM (%)53.6
21
Multi-hop Question AnsweringBamboogle standard (val)
Exact Match (EM)40
20
Multi-Hop Question AnsweringBamboogle (out-of-domain)
Accuracy73.8
19
Knowledge-Intensive ReasoningBamboogle
F173.8
18
Multi-Hop Question AnsweringBamboogle
EM42.4
18
Question AnsweringBamboogle
Cover Exact Match62.4
18
Multi-hop Question AnsweringBamboogle (dev test)
F1 Score68.2
17
Multi-Hop Question AnsweringBamboogle
Score50
16
Multi-Hop Question AnsweringBamboogle out-of-domain (val test)
Exact Match (EM)56.4
15
Agentic SearchBamboogle
String-F1 Score73.1
14
Showing 25 of 47 rows