Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Logical Reasoning on BBH Web of Lies
Loading...
98
Accuracy
evaluation-instructed prompt optimization
64.72
73.36
82
90.64
Nov 25, 2025
Accuracy
Updated 1d ago
Evaluation Results
Method
Method
Links
Accuracy
evaluation-instructed prompt optimization
Backbone=GPT-4o, Optim...
2025.11
98
LLM only
Backbone=GPT-4o, Optim...
2025.11
96
Pro-Refine
Backbone=GPT-4o, Optim...
2025.11
96
TextGrad
Backbone=GPT-4o, Optim...
2025.11
96
APE
Backbone=GPT-4o, Optim...
2025.11
95
Self-Refine
Backbone=GPT-4o, Optim...
2025.11
94
TextGrad
Backbone=LLaMA-3, Opti...
2025.11
74
TextGrad
Backbone=LLaMA-3.1, Op...
2025.11
73
APE
Backbone=LLaMA-3, Opti...
2025.11
72
Pro-Refine
Backbone=LLaMA-3, Opti...
2025.11
71
APE
Backbone=LLaMA-3.1, Op...
2025.11
71
Pro-Refine
Backbone=LLaMA-3.1, Op...
2025.11
70
evaluation-instructed prompt optimization
Backbone=LLaMA-3, Opti...
2025.11
69
LLM only
Backbone=LLaMA-3.1, Op...
2025.11
69
evaluation-instructed prompt optimization
Backbone=LLaMA-3.1, Op...
2025.11
68
Self-Refine
Backbone=LLaMA-3, Opti...
2025.11
67
LLM only
Backbone=LLaMA-3, Opti...
2025.11
66
Self-Refine
Backbone=LLaMA-3.1, Op...
2025.11
66
Feedback
Search any
task
Search any
task