Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
STRING on ShellOps
Loading...
49.1
LLM Judge Accuracy
A3
17.38
25.615
33.85
42.085
May 8, 2026
LLM Judge Accuracy
Updated 23d ago
Evaluation Results
Method
Method
Links
LLM Judge Accuracy
A3
Harness Context=σ-Reveal
2026.05
49.1
A3
Harness Context=Vanilla
2026.05
49
LATS
2026.05
28.8
ReACT
2026.05
28.3
GSPO
2026.05
27.7
GiGPO
2026.05
27
HGPO
2026.05
23.9
RetroAgent
2026.05
20
rStar
2026.05
18.6
Feedback
Search any
task
Search any
task