Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Policy Optimization on RLHF
Loading...
8.3
True Score
ORPO
2.0704
3.6877
5.305
6.9223
Apr 13, 2026
True Score
Proxy Score
Worst Case Value
Updated 4d ago
Evaluation Results
Method
Method
Links
True Score
Proxy Score
Worst Case Value
ORPO
Environment=RLHF
2026.04
8.3
0.63
-1.84
ORPO
2026.04
8.3
0.63
-1.84
Max-Min
Environment=RLHF
2026.04
5.38
0.84
-0.1
Max-Min
2026.04
5.38
0.84
-0.1
Ensemble
Environment=RLHF
2026.04
2.31
1.26
-1.7
Feedback
Search any
task
Search any
task