Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reward Prediction on Human preference annotation (test)
Loading...
30
Effort
Qwen2.5-7B-Instr.
28.4
39.2
50
60.8
Jan 23, 2026
Effort
Evidence
Grounding
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
Effort
Evidence
Grounding
Accuracy
Qwen2.5-7B-Instr.
Checkpoint=Original
2026.01
30
26
28
28
GPT-4.1
Checkpoint=Zero Shot
2026.01
44
22
30
32
gpt-oss-20b
Checkpoint=SFT
2026.01
44
32
35
37
GPT-4.1
Checkpoint=SFT
2026.01
52
25
31
36
GPT-5
Checkpoint=Zero Shot
2026.01
56
54
49
53
Gemini 2.5 Flash
Checkpoint=Zero Shot
2026.01
57
25
29
37
Gemini 2.5 Flash
Checkpoint=SFT
2026.01
61
53
45
53
IntelliReward
2026.01
70
76
70
72
Feedback
Search any
task
Search any
task