Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Matching Human Consensus Feedback on Human Consensus Feedback
Loading...
13.8
Precision
GOODPOINT-SFT
4.336
6.793
9.25
11.707
Apr 13, 2026
Precision
Recall
F1 Score
Updated 4d ago
Evaluation Results
Method
Method
Links
Precision
Recall
F1 Score
GOODPOINT-SFT
Bootstrap samples (B)=...
2026.04
13.8
11.2
10.8
GPT-5.2
Bootstrap samples (B)=...
2026.04
13
16.5
13
Gemini-3-flash
Bootstrap samples (B)=...
2026.04
12.8
16.9
13.1
GOODPOINT-DPO
Bootstrap samples (B)=...
2026.04
9.3
10.7
8.7
Qwen3-8b (Base)
Bootstrap samples (B)=...
2026.04
6.9
8.4
6.8
Llama3.1-8b-Instruct
Bootstrap samples (B)=...
2026.04
4.7
5.3
4.4
Feedback
Search any
task
Search any
task