Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Failure Attribution on Who&When Hand-Crafted
Loading...
41.38
Step-level Accuracy
Famas
1.9952
12.2201
22.445
32.6699
Mar 12, 2026
Step-level Accuracy
Agent-level Accuracy
Updated 2mo ago
Evaluation Results
Method
Method
Links
Step-level Accuracy
Agent-level Accuracy
Famas
Evaluation Mode=Offlin...
2026.03
41.38
62.07
MASC
Evaluation Mode=Online...
2026.03
20.79
-
AgenTracer
Evaluation Mode=Offlin...
2026.03
20.68
63.82
AgenTracer (G)
Evaluation Mode=Offlin...
2026.03
20.68
69.1
PROMAS
Evaluation Mode=Online...
2026.03
19.14
27.66
MASC (G)
Evaluation Mode=Online...
2026.03
18.25
-
Step-by-Step
Evaluation Mode=Offlin...
2026.03
8.77
53.44
Step-by-Step (G)
Evaluation Mode=Offlin...
2026.03
7.02
34.48
Binary Search
Evaluation Mode=Offlin...
2026.03
6.9
36.21
Binary Search (G)
Evaluation Mode=Offlin...
2026.03
6.9
51.72
All-at-Once (G)
Evaluation Mode=Offlin...
2026.03
5.27
55.17
Random
Evaluation Mode=N/A
2026.03
4.16
12
All-at-Once
Evaluation Mode=Offlin...
2026.03
3.51
53.44
Feedback
Search any
task
Search any
task