Benchmark-500

Benchmarks

Task Name	Dataset Name	SOTA Result	Trend
Prefill-stage hallucination risk detection	Benchmark-500 Relaxed Consensus (Pvote ≥ 0.8)	AUROC (Mean)0.6957		4
Prefill-stage hallucination risk detection	Benchmark-500 Strict Consensus Pvote = 1.0 vs. Clean	AUROC (Mean)0.6939		4

Showing 2 of 2 rows