Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Sentence-level error detection on DeltaBench CoT Diagnosis 1.0 (test)
Loading...
43.2
Precision
GPT-5 (BIG-Bench Prompt)
3.576
13.863
24.15
34.437
Mar 22, 2026
Precision
Recall
F1 Score
Updated 25d ago
Evaluation Results
Method
Method
Links
Precision
Recall
F1 Score
GPT-5 (BIG-Bench Prompt)
Base Model=GPT-5, Prom...
2026.03
43.2
65.8
47
ReasonDiag
Base Model=GPT-5
2026.03
30.6
80.1
38.6
GPT-5 (DeltaBench Prompt)
Base Model=GPT-5, Prom...
2026.03
5.1
4.1
4.4
Feedback
Search any
task
Search any
task