Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bamboogle

Benchmarks

Task NameDataset NameSOTA ResultTrend
Question AnsweringBamboogle
EM60
227
Multi-Hop Question AnsweringBamboogle
Exact Match56
128
Multi-hop Question AnsweringBamboogle (test)
EM57.6
98
Question AnsweringBamboogle
EM Accuracy (%)48
68
Multi-hop Question AnsweringBamboogle
Accuracy75.2
62
Question AnsweringBamboogle
EM64.1
61
Multi-Hop Question AnsweringBamboogle
Exact Match (EM)54.4
55
Multi-Hop Question AnsweringBamboogle
EM78.4
51
Question AnsweringBamboogle (test)
EM (%)53.6
47
Multi-Hop QABamboogle
Exact Match (EM)57.8
46
ReasoningBamboogle
Accuracy73
46
Multi-Hop Question AnsweringBamboogle
Accuracy47.6
44
Error DetectionBamboogle Full
Precision100
36
Error DetectionBamboogle
F1 Score0.94
36
Multi-Hop Question AnsweringBamboogle (test)
Exact Match (EM)74.2
33
Multi-hop QABamboogle
EM56
27
Multi-Hop QABamboogle
Accuracy (%)74.9
25
Multi-Hop Question AnsweringBamboogle
F161.6
25
Open-domain Question AnsweringBamboogle out-of-domain
F171.7
24
Multi-hop Question AnsweringBamboogle online Google Search API (test val)
Exact Match68.7
24
Multi-hop Question AnsweringBamboogle offline Wiki-18 (test val)
Exact Match (EM)53.4
24
Confidence Calibration in Retrieval-Augmented GenerationBamboogle k=5 OOD (test)
ECE0.065
24
CalibrationBamboogle
ECE0.113
24
Knowledge-Intensive ReasoningBamboogle
F173.8
23
Multi-hop Question AnsweringBamboogle
EM57.6
21
Showing 25 of 78 rows