Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Error category prediction on TRAIL Planning and Reasoning categories (117 traces)
Loading...
49.7
Micro F1
GPT-5
33.58
37.765
41.95
46.135
May 21, 2026
Micro F1
Macro Cat F1
Updated 12d ago
Evaluation Results
Method
Method
Links
Micro F1
Macro Cat F1
GPT-5
mapping_strategy=full+...
2026.05
49.7
45.9
GPT-5
mapping_strategy=full,...
2026.05
46.7
36.8
Always top-4
baseline_type=top-4 mo...
2026.05
45.9
19.9
OSS-120B
mapping_strategy=full+...
2026.05
42.7
37.4
OSS-120B
mapping_strategy=full,...
2026.05
37.7
26.1
Random (GT freq)
baseline_type=random p...
2026.05
34.2
28.8
Feedback
Search any
task
Search any
task