Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

IFEval, EvalPlus, MATH, and GAIA2

Benchmarks

Task NameDataset NameSOTA ResultTrend
Attribution Faithfulness EvaluationIFEval, EvalPlus, MATH, and GAIA2 60 failure cases
Average Tokens to Fix1.7
3
Showing 1 of 1 rows