| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| FaithCoT-Bench | GeoFaith | F1 Score61.7 | 10 | 7d ago | |
| ProcessBench | F1 Score83.2 | 10 | 7d ago | ||
| FCGPT | GeoFaith | Accuracy93.5 | 10 | 7d ago | |
| RAGTruth | GeoFaith | Accuracy90.3 | 10 | 7d ago | |
| In-domain Step-level Benchmark Agent | GeoFaith | FF180.2 | 10 | 7d ago | |
| In-domain Step-level Benchmark Knowledge | GeoFaith | FF183.4 | 10 | 7d ago | |
| In-domain Step-level Benchmark Reasoning | GeoFaith | FF184.5 | 10 | 7d ago | |
| Step-level Benchmark In-domain Math | GeoFaith | FF184.2 | 10 | 7d ago | |
| FEVER n=200 | M41 | 6 | 1mo ago | ||
| LatentAudit Mistral-7B (evaluation) | GPT-4o Judge | AUROC94 | 6 | 1mo ago | |
| LatentAudit Qwen-2.5-7B (evaluation) | GPT-4o Judge | AUROC94.5 | 6 | 1mo ago | |
| LatentAudit Llama-3-8B (evaluation set) | GPT-4o Judge | AUROC0.948 | 6 | 1mo ago |