Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning on BigBench Hard Penguins
Loading...
44.1
Accuracy
ReElicit
28.708
32.704
36.7
40.696
May 18, 2026
Accuracy
Updated 14d ago
Evaluation Results
Method
Method
Links
Accuracy
ReElicit
evaluations=30 prompt...
2026.05
44.1
OPRO
evaluations=30 prompt...
2026.05
43.9
APE
evaluations=30 prompt...
2026.05
43.4
TextGrad
evaluations=30 prompt...
2026.05
33.1
PromptBreeder
evaluations=30 prompt...
2026.05
29.3
Feedback
Search any
task
Search any
task