Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Error category prediction on TRAIL Planning and Reasoning categories (117 traces)

49.7Micro F1

GPT-5

33.5837.76541.9546.135May 21, 2026
Updated 12d ago

Evaluation Results

MethodLinks
2026.05
49.745.9
2026.05
46.736.8
2026.05
45.919.9
2026.05
42.737.4
2026.05
37.726.1
2026.05
34.228.8