Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-step Reasoning on Bamboogle auto-eval (test)
Loading...
76.1
Mean Accuracy
Self-improvement, 2nd gen
43.444
51.922
60.4
68.878
Dec 15, 2023
Mean Accuracy
Standard Deviation
Updated 4d ago
Evaluation Results
Method
Method
Links
Mean Accuracy
Standard Deviation
Self-improvement, 2nd gen
Model Size=L
2023.12
76.1
1.3
Self-improvement, 1st gen
Model Size=L
2023.12
74
3.3
Pilot, human filtered
Model Size=L
2023.12
71.5
2.2
Pre-trained
Model Size=L
2023.12
70.3
3.5
Self-improvement, 2nd gen
Model Size=S
2023.12
69.7
1.3
Self-improvement, 2nd gen
Model Size=XS
2023.12
65.9
2.6
Self-improvement, 1st gen
Model Size=S
2023.12
61.9
1.9
Pilot, human filtered
Model Size=S
2023.12
56.6
3.8
Self-improvement, 1st gen
Model Size=XS
2023.12
54.4
3.6
Pilot, human filtered
Model Size=XS
2023.12
44.7
3.1
Feedback
Search any
task
Search any
task