Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mistake math

Benchmarks

Task NameDataset NameSOTA ResultTrend
Dishonesty EvaluationMistake math (test)
Benchmark Dishonesty44.16
96
Data RankingMistake math
AUROC0.79
84
Showing 2 of 2 rows