Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Benchmark-500

Benchmarks

Task NameDataset NameSOTA ResultTrend
Prefill-stage hallucination risk detectionBenchmark-500 Relaxed Consensus (Pvote ≥ 0.8)
AUROC (Mean)0.6957
4
Prefill-stage hallucination risk detectionBenchmark-500 Strict Consensus Pvote = 1.0 vs. Clean
AUROC (Mean)0.6939
4
Showing 2 of 2 rows