Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Agreement with outcome human labels on CUAVerifierBench Internal Dataset
Loading...
81
Accuracy
Universal Verifier (UV)
63.32
67.91
72.5
77.09
Apr 5, 2026
Accuracy
F1 Score
Cohen’s Kappa
False Negative Rate (FNR)
False Positive Rate (FPR)
Updated 9d ago
Evaluation Results
Method
Method
Links
Accuracy
F1 Score
Cohen’s Kappa
False Negative Rate (FNR)
False Positive Rate (FPR)
Universal Verifier (UV)
Verifier=UV, Backbone=...
2026.04
81
81
64
32
1
Universal Verifier
Verifier=UV, Base Mode...
2026.04
81
81
64
32
1
WebJudge
Verifier=WebJudge, Bac...
2026.04
72
74
44
33
22
WebJudge
Verifier=WebJudge, Bas...
2026.04
72
74
44
33
22
WebVoy.
Verifier=WebVoy., Back...
2026.04
70
69
43
44
10
WebVoy.
Verifier=WebVoy., Back...
2026.04
67
73
31
24
45
WebVoyager
Verifier=WebVoy., Base...
2026.04
67
73
31
24
45
WebJudge
Verifier=WebJudge, Bac...
2026.04
64
58
33
57
7
Feedback
Search any
task
Search any
task